NIST 2009 Open Machine Translation Evaluation (MT09)
Informal System Combination Results

Date of release: Tue Oct 27 15:48:58 2009
Version: mt09_public_v1

Introduction

The NIST 2009 Open Machine Translation Evaluation (MT09) is part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT09 evaluation plan.

Informal System Combination was an informal, diagnostic MT09 task, offered after the official evaluation period. Output from several MT09 systems on the Arabic-toEnglish and Urdu-to-English Current tests was anonymized and provided for system combination purposes. Participants in this category produced new output based on those provided translations.

Scores reported here are limited to primary Informal System Combination submissions.

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT09 was an evaluation of research algorithms, the MT09 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

Certain commercial equipment, instruments, software, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by NIST, nor is it intended to imply that the equipment, instruments, software or materials are necessarily the best available for the purpose.

There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.

History

2009/10/27 : First public release

Evaluation Data

System output for the Informal System Combination track included output of the Arabic-to-English and Urdu-to-English Current tests. Approximately 30% of the test data was designated as a development set for system combination. The remainder of the system output was provided as the test set.

Language Pair	Data Genre	Development Set	Evaluation Set
Arabic-to-English	Newswire	17 documents	42 documents
Arabic-to-English	Web	16 documents	40 documents
Urdu-to-English	Newswire	20 documents	48 documents
Urdu-to-English	Web	48 documents	114 documents

Informal System Combination Results

Arabic-to-English (Table 1)

Site ID	System	BLEU-4 (mteval-v13a)			IBM BLEU (bleu-1.04)			NIST (mteval-v13a)			TER (tercom-0.7.25)			METEOR (meteor-0.7)
Site ID	System	Overall	Newswire	Web	Overall	Newswire	Web	Overall	Newswire	Web	Overall	Newswire	Web	Overall	Newswire	Web
bbn	BBN_a2e_isc_primary	0.5747	0.6440	0.4940	0.5747	0.6440	0.4938	11.82	11.84	10.41	0.3761	0.3220	0.4298	0.7043	0.7601	0.6469
sri	SRI_a2e_isc_primary	0.5543	0.6292	0.4733	0.5542	0.6291	0.4732	11.68	11.79	10.26	0.3788	0.3244	0.4328	0.6989	0.7474	0.6493
cmu-statxfer	CMU-Stat-Xfer_a2e_isc_primary	0.5530	0.6332	0.4663	0.5529	0.6330	0.4662	11.62	11.80	10.15	0.3854	0.3279	0.4427	0.7033	0.7518	0.6538
rwth	RWTH_a2e_isc_primary	0.5515	0.6412	0.4523	0.5517	0.6411	0.4523	11.56	11.86	9.879	0.3923	0.3229	0.4613	0.6928	0.7568	0.6272
jhu	jhu_a2e_isc_primary	0.5483	0.6294	0.4577	0.5481	0.6291	0.4574	11.55	11.73	10.01	0.3862	0.3272	0.4448	0.6919	0.7494	0.6330
hit-ltrc	HIT-LTRC_a2e_isc_primary	0.5037	0.5997	0.3982	0.5038	0.6000	0.3981	10.65	11.48	8.406	0.4135	0.3472	0.4793	0.6596	0.7249	0.5922
tubitak-uekae	TUBITAK_a2e_isc_primary	0.4603	0.5371	0.3779	0.4603	0.5371	0.3779	10.31	10.75	8.726	0.4525	0.3942	0.5105	0.6263	0.6882	0.5625
Highest individual system score in ISC test set (system with highest BLEU-4 score on Overall data set)
system08_unconstrained.xml		0.5008	0.5719	0.4245	0.5007	0.5720	0.4243	11.04	11.28	9.598	0.4229	0.3641	0.4813	0.6694	0.7271	0.6104

Urdu-to-English (Table 2)

Site ID	System	BLEU-4 (mteval-v13a)			IBM BLEU (bleu-1.04)			NIST (mteval-v13a)			TER (tercom-0.7.25)			METEOR (meteor-0.7)
Site ID	System	Overall	Newswire	Web	Overall	Newswire	Web	Overall	Newswire	Web	Overall	Newswire	Web	Overall	Newswire	Web
rwth	RWTH_u2e_isc_primary⁽¹⁾	0.3232	0.3768	0.2737	0.3235	0.3767	0.2740	8.822	9.274	7.425	0.5630	0.5383	0.5833	0.5539	0.6105	0.5046
jhu	jhu_u2e_isc_primary	0.3193	0.3796	0.2627	0.3191	0.3792	0.2627	8.736	9.197	7.418	0.5590	0.5317	0.5815	0.5512	0.6073	0.5022
cmu-statxfer	CMU-Stat-Xfer_u2e_isc_primary	0.3188	0.3821	0.2602	0.3188	0.3821	0.2602	8.694	9.154	7.353	0.5741	0.5422	0.6004	0.5560	0.6170	0.5030
hit-ltrc	HIT-LTRC_u2e_isc_primary	0.3103	0.3774	0.2453	0.3104	0.3773	0.2455	8.639	9.195	7.271	0.5820	0.5416	0.6152	0.5519	0.6184	0.4941
Highest individual system score in ISC test set (system with highest BLEU-4 score on Overall data set)
system09_constrained.xml		0.3104	0.3774	0.2456	0.3104	0.3773	0.2456	8.640	9.196	7.276	0.5816	0.5414	0.6146	0.5522	0.6186	0.4945

⁽¹⁾rescored

NIST 2009 Open Machine Translation Evaluation (MT09) Informal System Combination Results