Date of
release: Tue Oct 27 15:48:58 2009
Version: mt09_public_v1
The NIST 2009 Open Machine
Translation Evaluation (MT09) is part of an ongoing series of evaluations of
human language translation technology. NIST conducts these evaluations in order
to support machine translation (MT) research and help advance the state-of-the-art
in machine translation technology. These evaluations provide an important
contribution to the direction of research efforts and the calibration of
technical capabilities. The evaluation was administered as outlined in the official
MT09 evaluation plan.
Informal System
Combination was an informal, diagnostic MT09 task, offered after the official
evaluation period.
Output from several MT09 systems on the Arabic-toEnglish and Urdu-to-English
Current tests was anonymized and provided for system combination purposes.
Participants in this category produced new output based on those provided
translations.
Scores reported here are
limited to primary Informal System Combination submissions.
These results
are not to be construed, or represented as endorsements of any participant's
system or commercial product, or as official findings on the part of NIST or
the U.S. Government. Note that the results submitted by developers of
commercial MT products were generally from research systems, not commercially
available products. Since MT09 was an evaluation of research algorithms, the
MT09 test design required local implementation by each participant. As such,
participants were only required to submit their translation system output to
NIST for uniform scoring and analysis. The systems themselves were not
independently evaluated by NIST.
Certain
commercial equipment, instruments, software, or materials are identified in
this paper in order to specify the experimental procedure adequately. Such
identification is not intended to imply recommendation or endorsement by NIST,
nor is it intended to imply that the equipment, instruments, software or
materials are necessarily the best available for the purpose.
There is ongoing
discussion within the MT research community regarding the most informative
metrics for machine translation. The design and implementation of these metrics
are themselves very much part of the research. At the present time, there is no
single metric that has been deemed to be completely indicative of all aspects
of system performance.
The data,
protocols, and metrics employed in this evaluation were chosen to support MT
research and should not be construed as indicating how well these systems would
perform in applications. While changes in the data domain, or changes in the
amount of data used to build a system, can greatly influence system
performance, changing the task protocols could indicate different performance
strengths and weaknesses for these same systems.
Because
of the above reasons, this should not be interpreted as a product testing
exercise and the results should not be used to make conclusions regarding which
commercial products are best for a particular application.
System output for the Informal
System Combination track included output of the Arabic-to-English and
Urdu-to-English Current tests. Approximately 30% of the test data was
designated as a development set for system combination. The remainder of the
system output was provided as the test set.
Language Pair |
Data Genre |
Development Set |
Evaluation Set |
Arabic-to-English |
Newswire |
17 documents |
42 documents |
Web |
16 documents |
40 documents |
|
Urdu-to-English |
Newswire |
20 documents |
48 documents |
Web |
48 documents |
114 documents |
Site ID |
System |
BLEU-4 (mteval-v13a) |
IBM BLEU (bleu-1.04) |
NIST (mteval-v13a) |
TER (tercom-0.7.25) |
METEOR (meteor-0.7) |
||||||||||
Overall |
Newswire |
Web |
Overall |
Newswire |
Web |
Overall |
Newswire |
Web |
Overall |
Newswire |
Web |
Overall |
Newswire |
Web |
||
bbn |
BBN_a2e_isc_primary |
0.5747 |
0.6440 |
0.4940 |
0.5747 |
0.6440 |
0.4938 |
11.82 |
11.84 |
10.41 |
0.3761 |
0.3220 |
0.4298 |
0.7043 |
0.7601 |
0.6469 |
sri |
SRI_a2e_isc_primary |
0.5543 |
0.6292 |
0.4733 |
0.5542 |
0.6291 |
0.4732 |
11.68 |
11.79 |
10.26 |
0.3788 |
0.3244 |
0.4328 |
0.6989 |
0.7474 |
0.6493 |
cmu-statxfer |
CMU-Stat-Xfer_a2e_isc_primary |
0.5530 |
0.6332 |
0.4663 |
0.5529 |
0.6330 |
0.4662 |
11.62 |
11.80 |
10.15 |
0.3854 |
0.3279 |
0.4427 |
0.7033 |
0.7518 |
0.6538 |
rwth |
RWTH_a2e_isc_primary |
0.5515 |
0.6412 |
0.4523 |
0.5517 |
0.6411 |
0.4523 |
11.56 |
11.86 |
9.879 |
0.3923 |
0.3229 |
0.4613 |
0.6928 |
0.7568 |
0.6272 |
jhu |
jhu_a2e_isc_primary |
0.5483 |
0.6294 |
0.4577 |
0.5481 |
0.6291 |
0.4574 |
11.55 |
11.73 |
10.01 |
0.3862 |
0.3272 |
0.4448 |
0.6919 |
0.7494 |
0.6330 |
hit-ltrc |
HIT-LTRC_a2e_isc_primary |
0.5037 |
0.5997 |
0.3982 |
0.5038 |
0.6000 |
0.3981 |
10.65 |
11.48 |
8.406 |
0.4135 |
0.3472 |
0.4793 |
0.6596 |
0.7249 |
0.5922 |
tubitak-uekae |
TUBITAK_a2e_isc_primary |
0.4603 |
0.5371 |
0.3779 |
0.4603 |
0.5371 |
0.3779 |
10.31 |
10.75 |
8.726 |
0.4525 |
0.3942 |
0.5105 |
0.6263 |
0.6882 |
0.5625 |
Highest individual system score in ISC test set (system
with highest BLEU-4 score on Overall data set) |
||||||||||||||||
system08_unconstrained.xml |
0.5008 |
0.5719 |
0.4245 |
0.5007 |
0.5720 |
0.4243 |
11.04 |
11.28 |
9.598 |
0.4229 |
0.3641 |
0.4813 |
0.6694 |
0.7271 |
0.6104 |
Site ID |
System |
BLEU-4 (mteval-v13a) |
IBM BLEU (bleu-1.04) |
NIST (mteval-v13a) |
TER (tercom-0.7.25) |
METEOR (meteor-0.7) |
||||||||||
Overall |
Newswire |
Web |
Overall |
Newswire |
Web |
Overall |
Newswire |
Web |
Overall |
Newswire |
Web |
Overall |
Newswire |
Web |
||
rwth |
RWTH_u2e_isc_primary(1) |
0.3232 |
0.3768 |
0.2737 |
0.3235 |
0.3767 |
0.2740 |
8.822 |
9.274 |
7.425 |
0.5630 |
0.5383 |
0.5833 |
0.5539 |
0.6105 |
0.5046 |
jhu |
jhu_u2e_isc_primary |
0.3193 |
0.3796 |
0.2627 |
0.3191 |
0.3792 |
0.2627 |
8.736 |
9.197 |
7.418 |
0.5590 |
0.5317 |
0.5815 |
0.5512 |
0.6073 |
0.5022 |
cmu-statxfer |
CMU-Stat-Xfer_u2e_isc_primary |
0.3188 |
0.3821 |
0.2602 |
0.3188 |
0.3821 |
0.2602 |
8.694 |
9.154 |
7.353 |
0.5741 |
0.5422 |
0.6004 |
0.5560 |
0.6170 |
0.5030 |
hit-ltrc |
HIT-LTRC_u2e_isc_primary |
0.3103 |
0.3774 |
0.2453 |
0.3104 |
0.3773 |
0.2455 |
8.639 |
9.195 |
7.271 |
0.5820 |
0.5416 |
0.6152 |
0.5519 |
0.6184 |
0.4941 |
Highest individual system score in ISC test set (system
with highest BLEU-4 score on Overall data set) |
||||||||||||||||
system09_constrained.xml |
0.3104 |
0.3774 |
0.2456 |
0.3104 |
0.3773 |
0.2456 |
8.640 |
9.196 |
7.276 |
0.5816 |
0.5414 |
0.6146 |
0.5522 |
0.6186 |
0.4945 |
(1)rescored