Machine Translation

Machine Translation – Volume 23, no.1, February 2009, pp.1-22

Automatically generated parallel treebanks and their
exploitability in machine translation

John Tinsley • Andy Way

Received: 25 November 2008 / Accepted: 15 December 2009 / Published online: 23 January 2010
© Springer Science+Business Media B.V. 2010

Abstract Given much recent discussion and the shift in focus of the field, it is
becoming apparent that the incorporation of syntax is the way forward for improve-
ments to the current state-of-the-art in machine translation (MT). Parallel treebanks
are a relatively recent innovation and appear to be ideal candidates for MT training
material. However, until recently there has been no other means to build them than
by hand. In this paper, we describe how we make use of new tools to automatically
build a large parallel treebank and extract a set of linguistically-motivated phrase pairs
from it. We show that adding these phrase pairs to the translation model of a baseline
phrase-based statistical MT (PB-SMT) system leads to significant improvements in
translation quality. Following this, we describe experiments in which we exploit the
information encoded in the parallel treebank in other areas of the PB-SMT framework,
while investigating the conditions under which the incorporation of parallel treebank
data performs optimally. Finally, we discuss the possibility of exploiting automati-
cally-generated parallel treebanks further in syntax-aware paradigms of MT.

Keywords Parallel treebanks • Statistical machine translation • Phrase-based
statistical machine translation • Syntax in machine translation • Sub-tree alignment •
Translation modelling • Resource combination • Word alignment • Phrase alignment •
Hybrid models

Machine Translation – Volume 23, no.1, February 2009, pp.23-63

Symbolic-to-statistical hybridization: extending
generation-heavy machine translation

Nizar Habash • Bonnie Dorr • Christof Monz

Received: 28 January 2009 / Accepted: 5 October 2009 / Published online: 11 November 2009
© The Author(s) 2009. This article is published with open access at Springerlink.com

Abstract The last few years have witnessed an increasing interest in hybridizing
surface-based statistical approaches and rule-based symbolic approaches to machine
translation (MT). Much of that work is focused on extending statistical MT systems
with symbolic knowledge and components. In the brand of hybridization discussed
here, we go in the opposite direction: adding statistical bilingual components to a
symbolic system. Our base system is Generation-heavy machine translation (GHMT),
a primarily symbolic asymmetrical approach that addresses the issue of Interlingual
MT resource poverty in source-poor/target-rich language pairs by exploiting symbolic
and statistical target-language resources. GHMT's statistical components are limited
to target-language models, which arguably makes it a simple form of a hybrid system.
We extend the hybrid nature of GHMT by adding statistical bilingual components.
We also describe the details of retargeting it to Arabic-English MT. The morpho-
logical richness of Arabic brings several challenges to the hybridization task. We
conduct an extensive evaluation of multiple system variants. Our evaluation shows
that this new variant of GHMT—a primarily symbolic system extended with mono-
lingual and bilingual statistical components—has a higher degree of grammaticality
than a phrase-based statistical MT system, where grammaticality is measured in terms
of correct verb-argument realization and long-distance dependency translation.

Machine Translation – Volume 23, nos.2-3, September 2009, pp.71-103

The NIST 2008 Metrics for machine translation
challenge—overview, methodology, metrics, and results

Mark Przybocki • Kay Peterson •
Sebastian Bronsart • Gregory Sanders

Received: 18 May 2009 / Accepted: 26 November 2009 / Published online: 14 January 2010
© US Government 2010

Abstract This paper discusses the evaluation of automated metrics developed for
the purpose of evaluating machine translation (MT) technology. A general discussion
of the usefulness of automated metrics is offered. The NIST MetricsMATR evaluation
of MT metrology is described, including its objectives, protocols, participants, and
test data. The methodology employed to evaluate the submitted metrics is reviewed.
A summary is provided for the general classes of evaluated metrics. Overall results
of this evaluation are presented, primarily by means of correlation statistics, show-
ing the degree of agreement between the automated metric scores and the scores of
human judgments. Metrics are analyzed at the sentence, document, and system level
with results conditioned by various properties of the test data. This paper concludes
with some perspective on the improvements that should be incorporated into future
evaluations of metrics for MT evaluation.

Keywords Automated MT metrics • Metric evaluation • MetricsMATR

Machine Translation – Volume 23, nos.2-3, September 2009, pp.105-115

The Meteor metric for automatic evaluation
of machine translation

Alon Lavie • Michael J. Denkowski

Received: 16 May 2009 / Accepted: 13 October 2009 / Published online: 1 November 2009
© Springer Science+Business Media B.V. 2009

Abstract The meteor Automatic Metric for Machine Translation evaluation, orig-
inally developed and released in 2004, was designed with the explicit goal of producing
sentence-level scores which correlate well with human judgments of translation qual-
ity. Several key design decisions were incorporated into Meteor in support of this goal.
In contrast with IBM's Bleu, which uses only precision-based features, Meteor uses
and emphasizes recall in addition to precision, a property that has been confirmed by
several metrics as being critical for high correlation with human judgments. Meteor
also addresses the problem of reference translation variability by utilizing flexible
word matching, allowing for morphological variants and synonyms to be taken into
account as legitimate correspondences. Furthermore, the feature ingredients within
Meteor are parameterized, allowing for the tuning of the metric's free parameters
in search of values that result in optimal correlation with human judgments. Optimal
parameters can be separately tuned for different types of human judgments and for
different languages. We discuss the initial design of the Meteor metric, subsequent
improvements, and performance in several independent evaluations in recent years.

Keywords Machine translation • MT Evaluation • Automatic metrics

Machine Translation – Volume 23, nos.2-3, September 2009, pp.117-127

TER-Plus: paraphrase, semantic, and alignment
enhancements to Translation Edit Rate

Matthew G. Snover • Nitin Madnani •
Bonnie Dorr • Richard Schwartz

Received: 15 May 2009 / Accepted: 16 November 2009 / Published online: 15 December 2009
© Springer Science+Business Media B.V. 2009

Abstract This paper describes a new evaluation metric, TER-Plus (TERp) for auto-
matic evaluation of machine translation (MT). TERp is an extension of Translation
Edit Rate (TER). It builds on the success of TER as an evaluation metric and alignment
tool and addresses several of its weaknesses through the use of paraphrases, stemming,
synonyms, as well as edit costs that can be automatically optimized to correlate better
with various types of human judgments. We present a correlation study comparing
TERp to BLEU, METEOR and TER, and illustrate that TERp can better evaluate
translation adequacy.

Keywords Machine translation evaluation • Paraphrasing • Alignment

Machine Translation – Volume 23, nos.2-3, September 2009, pp.129-140

Edit distances with block movements and error rate
confidence estimates

Gregor Leusch • Hermann Ney

Received: 9 May 2009 / Accepted: 23 November 2009 / Published online: 13 December 2009
© Springer Science+Business Media B.V. 2009

Abstract We present two evaluation measures for Machine Translation (MT), which
are defined as error rates extended by block moves. In contrast to ter, these measures
are constrained in a way that allows for an exact calculation in polynomial time. We
then investigate three methods to estimate the standard error of error rates, and compare
them to bootstrap estimates. We assess the correlation of our proposed measures with
human judgment using data from the National Institute of Standards and Technology
(NIST) 2008 MetricsMATR workshop.

Keywords Machine translation • Evaluation • Bootstrap • Confidence intervals

Machine Translation – Volume 23, nos.2-3, September 2009, pp.141-155

ATEC: automatic evaluation of machine translation via
word choice and word order

Billy Wong • Chunyu Kit

Received: 15 May 2009 / Accepted: 29 October 2009 / Published online: 12 December 2009
© Springer Science+Business Media B.V. 2009

Abstract We propose a novel metric ATEC for automatic MT evaluation based on
explicit assessment of word choice and word order in an MT output in comparison
to its reference translation(s), the two most fundamental factors in the construction of
meaning for a sentence. The former is assessed by matching word forms at various lin-
guistic levels, including surface form, stem, sound and sense, and further by weighing
the informativeness of each word. The latter is quantified in term of the discordance
of word position and word sequence between a translation candidate and its reference.
In the evaluations using the MetricsMATR08 data set and the LDC MTC2 and MTC4
corpora, ATEC demonstrates an impressive positive correlation to human judgments
at the segment level, highly comparable to the few state-of-the-art evaluation metrics.

Keywords MT evaluation • Evaluation metrics • ATEC • Word choice • Word order

Machine Translation – Volume 23, nos.2-3, September 2009, pp.157-168

MaxSim: performance and effects of translation fluency

Yee Seng Chan • Hwee Tou Ng

Received: 10 May 2009 / Accepted: 10 October 2009 / Published online: 31 October 2009
© Springer Science+Business Media B.V. 2009

Abstract This paper evaluates the performance of our recently proposed automatic
machine translation evaluation metric MaxSim and examines the impact of translation
fluency on the metric. MaxSim calculates a similarity score between a pair of English
system-reference sentences by comparing information items such as n-grams across
the sentence pair. Unlike most metrics which perform binary matching, MaxSim also
computes similarity scores between items and models them as nodes in a bipartite
graph to select a maximum weight matching. Our experiments show that MaxSim is
competitive with state-of-the-art metrics on benchmark datasets.

Keywords Machine translation evaluation • MaxSim • Bipartite matching • WMT •
MetricsMATR

Machine Translation – Volume 23, nos.2-3, September 2009, pp.169-179

Expected dependency pair match: predicting
translation quality with expected syntactic structure

Jeremy G. Kahn • Matthew Snover •
Mari Ostendorf

Received: 15 May 2009 / Accepted: 10 October 2009 / Published online: 31 October 2009
© Springer Science+Business Media B.V. 2009

Abstract Recent efforts to develop new machine translation evaluation methods
have tried to account for allowable wording differences either in terms of syntactic
structure or synonyms/paraphrases. This paper primarily considers syntactic struc-
ture, combining scores from partial syntactic dependency matches with standard local
n-gram matches using a statistical parser, and taking advantage of N-best parse prob-
abilities. The new scoring metric, expected dependency pair match (EDPM), is shown
to outperform BLEU and TER in terms of correlation to human judgments and as
a predictor of HTER. Further, we combine the syntactic features of EDPM with the
alternative wording features of TERp, showing a benefit to accounting for syntactic
structure on top of semantic equivalency features.

Keywords Machine translation evaluation • Syntax • Dependency trees

Machine Translation – Volume 23, nos.2-3, September 2009, pp.181-193

Measuring machine translation quality as semantic
equivalence: A metric based on entailment features

Sebastian Padó • Daniel Cer • Michel Galley •
Dan Jurafsky • Christopher D. Manning

Received: 10 May 2009 / Accepted: 15 October 2009 / Published online: 8 November 2009
© Springer Science+Business Media B.V. 2009

Abstract Current evaluation metrics for machine translation have increasing
difficulty in distinguishing good from merely fair translations. We believe the main
problem to be their inability to properly capture meaning: A good translation candi-
date means the same thing as the reference translation, regardless of formulation. We
propose a metric that assesses the quality of MT output through its semantic equiva-
lence to the reference translation, based on a rich set of match and mismatch features
motivated by textual entailment. We first evaluate this metric in an evaluation setting
against a combination metric of four state-of-the-art scores. Our metric predicts human
judgments better than the combination metric. Combining the entailment and tradi-
tional features yields further improvements. Then, we demonstrate that the entailment
metric can also be used as learning criterion in minimum error rate training (MERT)
to improve parameter estimation in MT system training. A manual evaluation of the
resulting translations indicates that the new model obtains a significant improvement
in translation quality.

Machine Translation – Volume 23, no.4, November 2009, pp.195-240

The hare and the tortoise: speed and accuracy
in translation retrieval

Timothy Baldwin

Received: 28 January 2009 / Accepted: 24 November 2009 / Published online: 26 February 2010
© Springer Science+Business Media B.V. 2010

Abstract This research looks at the effects of segment order and segmentation on
translation retrieval performance for an experimental Japanese-English translation
memory system. We implement a number of both bag-of-words and segment-order-
sensitive string comparison methods, and test each over character-based and word-
based indexing using n-grams of various orders. To evaluate accuracy, we propose an
automatic method which identifies the target-language string(s) which would lead to
the optimal translation for a given input, based on analysis of the held-out translation
and the current contents of the translation memory. Our results indicate that char-
acter-based indexing is superior to word-based indexing, and also that bag-of-words
methods are equivalent to segment-order-sensitive methods in terms of accuracy but
vastly superior in terms of retrieval speed, suggesting that word segmentation and
segment-order sensitivity are unnecessary luxuries for translation retrieval.

Keywords Translation memory • Translation retrieval • Character- and word-based
indexing segmentation

Machine Translation – Volume 23, no.4, November 2009, pp.241-263

A process study of computer-aided translation

Philipp Koehn

Received: 10 June 2009 / Accepted: 20 April 2010 / Published online: 8 July 2010
© Springer Science+Business Media B.V. 2010

Abstract We investigate novel types of assistance for human translators, based on
statistical machine translation methods. We developed the computer-aided tool Caitra
that makes suggestions for sentence completion, shows word and phrase translation
options, and allows postediting of machine translation output. We carried out a study
of the translation process that involved non-professional translators that were native in
either French or English and recorded their interaction with the tool. Users translated
192 sentences from French news stories into English. Most translators were faster and
better when assisted by our tool. A detailed examination of the logs also provides
insight into the human translation process, such as time spent on different activities
and length of pauses.

Keywords Computer-aided translation • Interactive translation •
Translation process study • Statistical machine translation