NIST 2009 Open Machine Translation Evaluation (MT09)
Informal System Combination Results

Date of release: Tue Oct 27 15:48:58 2009
Version: mt09_public_v1

Introduction

The NIST 2009 Open Machine Translation Evaluation (MT09) is part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT09 evaluation plan.

Informal System Combination was an informal, diagnostic MT09 task, offered after the official evaluation period. Output from several MT09 systems on the Arabic-toEnglish and Urdu-to-English Current tests was anonymized and provided for system combination purposes. Participants in this category produced new output based on those provided translations.

Scores reported here are limited to primary Informal System Combination submissions.

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT09 was an evaluation of research algorithms, the MT09 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

Certain commercial equipment, instruments, software, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by NIST, nor is it intended to imply that the equipment, instruments, software or materials are necessarily the best available for the purpose.

There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.

History

Evaluation Data

System output for the Informal System Combination track included output of the Arabic-to-English and Urdu-to-English Current tests. Approximately 30% of the test data was designated as a development set for system combination. The remainder of the system output was provided as the test set.

Language Pair

Data Genre

Development Set

Evaluation Set

Arabic-to-English

Newswire

17 documents

42 documents

Web

16 documents

40 documents

Urdu-to-English

Newswire

20 documents

48 documents

Web

48 documents

114 documents

Informal System Combination Results

Arabic-to-English (Table 1)

Site ID

System

BLEU-4 (mteval-v13a)

IBM BLEU (bleu-1.04)

NIST (mteval-v13a)

TER (tercom-0.7.25)

METEOR (meteor-0.7)

Overall

Newswire

Web

Overall

Newswire

Web

Overall

Newswire

Web

Overall

Newswire

Web

Overall

Newswire

Web

bbn

BBN_a2e_isc_primary

0.5747

0.6440

0.4940

0.5747

0.6440

0.4938

11.82

11.84

10.41

0.3761

0.3220

0.4298

0.7043

0.7601

0.6469

sri

SRI_a2e_isc_primary

0.5543

0.6292

0.4733

0.5542

0.6291

0.4732

11.68

11.79

10.26

0.3788

0.3244

0.4328

0.6989

0.7474

0.6493

cmu-statxfer

CMU-Stat-Xfer_a2e_isc_primary

0.5530

0.6332

0.4663

0.5529

0.6330

0.4662

11.62

11.80

10.15

0.3854

0.3279

0.4427

0.7033

0.7518

0.6538

rwth

RWTH_a2e_isc_primary

0.5515

0.6412

0.4523

0.5517

0.6411

0.4523

11.56

11.86

9.879

0.3923

0.3229

0.4613

0.6928

0.7568

0.6272

jhu

jhu_a2e_isc_primary

0.5483

0.6294

0.4577

0.5481

0.6291

0.4574

11.55

11.73

10.01

0.3862

0.3272

0.4448

0.6919

0.7494

0.6330

hit-ltrc

HIT-LTRC_a2e_isc_primary

0.5037

0.5997

0.3982

0.5038

0.6000

0.3981

10.65

11.48

8.406

0.4135

0.3472

0.4793

0.6596

0.7249

0.5922

tubitak-uekae

TUBITAK_a2e_isc_primary

0.4603

0.5371

0.3779

0.4603

0.5371

0.3779

10.31

10.75

8.726

0.4525

0.3942

0.5105

0.6263

0.6882

0.5625

Highest individual system score in ISC test set (system with highest BLEU-4 score on Overall data set)

system08_unconstrained.xml

0.5008

0.5719

0.4245

0.5007

0.5720

0.4243

11.04

11.28

9.598

0.4229

0.3641

0.4813

0.6694

0.7271

0.6104

Urdu-to-English (Table 2)

Site ID

System

BLEU-4 (mteval-v13a)

IBM BLEU (bleu-1.04)

NIST (mteval-v13a)

TER (tercom-0.7.25)

METEOR (meteor-0.7)

Overall

Newswire

Web

Overall

Newswire

Web

Overall

Newswire

Web

Overall

Newswire

Web

Overall

Newswire

Web

rwth

RWTH_u2e_isc_primary(1)

0.3232

0.3768

0.2737

0.3235

0.3767

0.2740

8.822

9.274

7.425

0.5630

0.5383

0.5833

0.5539

0.6105

0.5046

jhu

jhu_u2e_isc_primary

0.3193

0.3796

0.2627

0.3191

0.3792

0.2627

8.736

9.197

7.418

0.5590

0.5317

0.5815

0.5512

0.6073

0.5022

cmu-statxfer

CMU-Stat-Xfer_u2e_isc_primary

0.3188

0.3821

0.2602

0.3188

0.3821

0.2602

8.694

9.154

7.353

0.5741

0.5422

0.6004

0.5560

0.6170

0.5030

hit-ltrc

HIT-LTRC_u2e_isc_primary

0.3103

0.3774

0.2453

0.3104

0.3773

0.2455

8.639

9.195

7.271

0.5820

0.5416

0.6152

0.5519

0.6184

0.4941

Highest individual system score in ISC test set (system with highest BLEU-4 score on Overall data set)

system09_constrained.xml

0.3104

0.3774

0.2456

0.3104

0.3773

0.2456

8.640

9.196

7.276

0.5816

0.5414

0.6146

0.5522

0.6186

0.4945

(1)rescored