Machine Translation Volume 23, no.1,
February 2009, pp.1-22
Automatically generated parallel treebanks
and their
exploitability
in machine translation
John Tinsley
Received: 25 November 2008 / Accepted: 15 December 2009 / Published online:
23 January 2010
© Springer Science+Business Media B.V. 2010
Abstract Given much recent discussion and the
shift in focus of the field, it is
becoming apparent that
the incorporation of syntax is the way forward for improve-
ments to the current state-of-the-art in
machine translation (MT). Parallel treebanks
are a relatively recent innovation and appear to be ideal candidates for MT
training
material. However, until recently there has
been no other means to build them than
by hand. In this paper, we describe how we
make use of new tools to automatically
build a large parallel treebank and extract a set of linguistically-motivated
phrase pairs
from it. We show that adding these
phrase pairs to the translation model of a baseline
phrase-based statistical MT (PB-SMT)
system leads to significant improvements in
translation quality. Following this,
we describe experiments in which we exploit the
information encoded in the parallel treebank in other areas of the PB-SMT framework,
while investigating the conditions
under which the incorporation of parallel treebank
data performs optimally. Finally, we
discuss the possibility of exploiting automati-
cally-generated parallel treebanks
further in syntax-aware paradigms of MT.
Keywords
Parallel treebanks Statistical machine
translation Phrase-based
statistical machine
translation Syntax in machine translation Sub-tree alignment
Translation modelling
Resource combination Word alignment Phrase alignment
Hybrid models
Machine Translation Volume 23, no.1,
February 2009, pp.23-63
Symbolic-to-statistical hybridization: extending
generation-heavy
machine translation
Nizar Habash Bonnie Dorr Christof Monz
Received: 28 January 2009 / Accepted: 5 October 2009 / Published online:
11 November 2009
© The
Author(s) 2009. This article is published with open access at Springerlink.com
Abstract The last few years have witnessed an increasing interest in hybridizing
surface-based statistical approaches and rule-based symbolic approaches to
machine
translation (MT). Much of that work is focused on extending statistical MT
systems
with symbolic knowledge and components. In the brand of hybridization discussed
here, we go in the
opposite direction: adding statistical bilingual components to a
symbolic system. Our
base system is Generation-heavy machine translation (GHMT),
a primarily symbolic
asymmetrical approach that addresses the issue of Interlingual
MT resource poverty
in source-poor/target-rich language pairs by exploiting symbolic
and statistical
target-language resources. GHMT's statistical
components are limited
to target-language
models, which arguably makes it a simple form of a hybrid system.
We extend the hybrid nature of GHMT by
adding statistical bilingual components.
We also describe the details of retargeting it to
logical richness of Arabic brings several
challenges to the hybridization task. We
conduct an extensive evaluation of multiple system variants. Our evaluation
shows
that this new variant of GHMTa
primarily symbolic system extended with mono-
lingual and bilingual statistical componentshas a higher degree of
grammaticality
than a phrase-based statistical MT
system, where grammaticality is measured in terms
of correct verb-argument realization
and long-distance dependency translation.
Machine Translation Volume 23, nos.2-3,
September 2009, pp.71-103
The NIST 2008 Metrics for
machine translation
challengeoverview,
methodology, metrics, and results
Mark Przybocki Kay Peterson
Sebastian Bronsart Gregory Sanders
Received: 18 May 2009 / Accepted: 26 November 2009 / Published online:
14 January 2010
© US
Government 2010
Abstract
This paper discusses
the evaluation of automated metrics developed for
the
purpose of evaluating machine translation (MT) technology. A general discussion
of the
usefulness of automated metrics is offered. The NIST MetricsMATR
evaluation
of MT metrology is described, including its
objectives, protocols, participants, and
test data. The methodology employed to evaluate the
submitted metrics is reviewed.
A summary is provided for the general
classes of evaluated metrics. Overall results
of this evaluation are
presented, primarily by means of correlation statistics, show-
ing the degree of agreement between the automated
metric scores and the scores of
human judgments. Metrics are analyzed at the sentence, document, and
system level
with results conditioned by various properties of the test data. This paper
concludes
with some perspective on the improvements
that should be incorporated into future
evaluations of metrics for MT
evaluation.
Keywords
Machine Translation Volume 23, nos.2-3,
September 2009, pp.105-115
The Meteor
metric for automatic evaluation
of machine translation
Alon Lavie Michael J. Denkowski
Received: 16 May 2009 / Accepted: 13 October 2009 / Published online: 1
November 2009
© Springer Science+Business Media B.V. 2009
Abstract
The meteor Automatic Metric for Machine
Translation evaluation, orig-
inally developed and released in 2004, was
designed with the explicit goal of producing
sentence-level scores which
correlate well with human judgments of translation qual-
ity. Several key design decisions were
incorporated into Meteor in
support of this goal.
In contrast with IBM's Bleu, which uses only
precision-based features, Meteor uses
and emphasizes recall in addition to precision, a property that has been
confirmed by
several metrics as being critical for high
correlation with human judgments. Meteor
also addresses the problem of reference
translation variability by utilizing flexible
word matching, allowing for morphological
variants and synonyms to be taken into
account as legitimate correspondences. Furthermore, the feature ingredients
within
Meteor are parameterized,
allowing for the tuning of the metric's free parameters
in search of values that result in
optimal correlation with human judgments. Optimal
parameters can be separately tuned for different types of human
judgments and for
different languages. We discuss the initial
design of the Meteor metric,
subsequent
improvements, and performance in several
independent evaluations in recent years.
Keywords Machine translation MT Evaluation Automatic metrics
Machine Translation Volume 23, nos.2-3,
September 2009, pp.117-127
TER-Plus: paraphrase, semantic, and alignment
enhancements
to Translation Edit Rate
Matthew
G. Snover Nitin Madnani
Bonnie Dorr Richard
Schwartz
Received: 15 May 2009 / Accepted: 16 November 2009 / Published online:
15 December 2009
©
Springer Science+Business Media B.V. 2009
Abstract
This paper describes a
new evaluation metric, TER-Plus (TERp) for auto-
matic evaluation of machine translation
(MT). TERp is an extension of Translation
Edit Rate (TER). It
builds on the success of TER as an evaluation metric and alignment
tool and addresses
several of its weaknesses through the use of paraphrases, stemming,
synonyms, as well as
edit costs that can be automatically optimized to correlate better
with various types of
human judgments. We present a correlation study comparing
TERp to BLEU, METEOR and TER, and illustrate that TERp can better evaluate
translation adequacy.
Keywords Machine translation evaluation Paraphrasing Alignment
Machine Translation Volume 23, nos.2-3,
September 2009, pp.129-140
Edit distances with block movements and error rate
confidence
estimates
Gregor Leusch Hermann Ney
Received: 9 May 2009 / Accepted: 23 November 2009 / Published online:
13 December 2009
© Springer Science+Business Media B.V. 2009
Abstract
We present two
evaluation measures for Machine Translation (MT), which
are defined as error rates extended by
block moves. In contrast to ter, these
measures
are constrained in a way
that allows for an exact calculation in polynomial time. We
then investigate three
methods to estimate the standard error of error rates, and compare
them to bootstrap
estimates. We assess the correlation of our proposed measures with
human judgment using data from the National Institute of Standards and Technology
(NIST) 2008 MetricsMATR workshop.
Keywords Machine translation Evaluation Bootstrap Confidence intervals
Machine Translation Volume 23, nos.2-3,
September 2009, pp.141-155
ATEC: automatic evaluation of machine translation via
word
choice and word order
Billy Wong Chunyu Kit
Received: 15 May 2009 / Accepted: 29 October 2009 / Published online:
12 December 2009
© Springer Science+Business Media B.V. 2009
Abstract
We propose a novel
metric ATEC for automatic MT evaluation based on
explicit assessment of
word choice and word order in an MT output in comparison
to its reference
translation(s), the two most fundamental factors in the construction of
meaning for a sentence.
The former is assessed by matching word forms at various lin-
guistic levels, including surface form,
stem, sound and sense, and further by weighing
the informativeness
of each word. The latter is quantified in term of the discordance
of word position and
word sequence between a translation candidate and its reference.
In the evaluations using
the MetricsMATR08 data set and the LDC MTC2 and MTC4
corpora, ATEC
demonstrates an impressive positive correlation to human judgments
at the segment level,
highly comparable to the few state-of-the-art evaluation metrics.
Keywords MT evaluation Evaluation metrics ATEC Word choice Word order
Machine Translation Volume 23, nos.2-3,
September 2009, pp.157-168
MaxSim: performance and effects of translation fluency
Yee Seng Chan Hwee Tou Ng
Received: 10 May 2009 / Accepted: 10 October 2009 / Published online: 31
October 2009
© Springer Science+Business Media B.V. 2009
Abstract
This paper evaluates
the performance of our recently proposed automatic
machine translation
evaluation metric MaxSim and examines the impact of translation
fluency on the metric. MaxSim calculates a similarity score between a pair of
English
system-reference
sentences by comparing information items such as n-grams across
the sentence pair. Unlike most metrics which perform binary matching, MaxSim also
computes similarity
scores between items and models them as nodes in a bipartite
graph to select a maximum
weight matching. Our experiments show that MaxSim is
competitive with
state-of-the-art metrics on benchmark datasets.
Keywords Machine translation evaluation MaxSim
Bipartite matching WMT
MetricsMATR
Machine Translation Volume 23, nos.2-3,
September 2009, pp.169-179
Expected dependency pair match: predicting
translation
quality with expected syntactic structure
Jeremy
G. Kahn Matthew Snover
Mari Ostendorf
Received: 15 May 2009 / Accepted: 10 October 2009 / Published online:
31 October 2009
© Springer Science+Business Media B.V. 2009
Abstract
Recent efforts to
develop new machine translation evaluation methods
have tried to account
for allowable wording differences either in terms of syntactic
structure or synonyms/paraphrases. This paper primarily considers syntactic struc-
ture, combining scores from partial
syntactic dependency matches with standard local
n-gram matches using a
statistical parser, and taking advantage of N-best parse prob-
abilities. The new
scoring metric, expected dependency pair match (EDPM), is shown
to outperform BLEU and TER
in terms of correlation to human judgments and as
a predictor of HTER.
Further, we combine the syntactic features of EDPM with the
alternative wording features of TERp, showing a
benefit to accounting for syntactic
structure on top of semantic equivalency features.
Keywords Machine translation evaluation Syntax Dependency trees
Machine Translation Volume 23, nos.2-3,
September 2009, pp.181-193
Measuring machine translation quality as semantic
equivalence: A metric based on entailment features
Sebastian
Padσ Daniel Cer Michel
Galley
Dan Jurafsky Christopher D. Manning
Received: 10 May 2009 / Accepted: 15 October 2009 / Published online: 8
November 2009
© Springer Science+Business Media B.V. 2009
Abstract Current evaluation metrics for machine translation have increasing
difficulty in distinguishing
good from merely fair translations. We believe the
main
problem to be their
inability to properly capture meaning: A good translation candi-
date means the
same thing as the reference translation, regardless of formulation. We
propose a metric that assesses the quality of MT output through its semantic equiva-
lence to the reference translation, based on a rich
set of match and mismatch features
motivated by textual
entailment. We first evaluate this metric in an evaluation setting
against a combination
metric of four state-of-the-art scores. Our metric predicts human
judgments better than
the combination metric. Combining the entailment and tradi-
tional features yields further
improvements. Then, we demonstrate that the entailment
metric can also be used
as learning criterion in minimum error rate training (MERT)
to improve parameter estimation in MT system training. A manual evaluation of
the
resulting translations
indicates that the new model obtains a significant improvement
in translation quality.
Machine Translation Volume 23, no.4,
November 2009, pp.195-240
The hare and the tortoise: speed and accuracy
in
translation retrieval
Timothy Baldwin
Received: 28 January 2009 / Accepted: 24 November 2009 / Published
online: 26 February 2010
©
Springer Science+Business Media B.V. 2010
Abstract This research looks at the effects of segment order and
segmentation on
translation retrieval performance for an experimental Japanese-English
translation
memory system. We implement a number of both
bag-of-words and segment-order-
sensitive string comparison methods,
and test each over character-based and word-
based indexing using n-grams of
various orders. To evaluate accuracy, we propose an
automatic method which identifies the
target-language string(s) which would lead to
the optimal translation for a given input, based on analysis of the held-out
translation
and the current contents of the translation memory. Our results indicate
that char-
acter-based indexing is superior to word-based
indexing, and also that bag-of-words
methods are equivalent to segment-order-sensitive methods in terms of accuracy
but
vastly superior in terms of
retrieval speed, suggesting that word segmentation and
segment-order sensitivity are unnecessary luxuries for translation retrieval.
Keywords Translation memory Translation retrieval Character- and word-based
indexing segmentation
Machine Translation Volume 23, no.4,
November 2009, pp.241-263
A process study of computer-aided translation
Philipp Koehn
Received: 10 June 2009 / Accepted: 20 April 2010 / Published online: 8
July 2010
©
Springer Science+Business Media B.V. 2010
Abstract
We investigate novel
types of assistance for human translators, based on
statistical machine
translation methods. We developed the computer-aided tool Caitra
that makes
suggestions for sentence completion, shows word and phrase translation
options, and allows postediting of machine translation output. We carried out a
study
of the translation
process that involved non-professional translators that were native in
either French or English
and recorded their interaction with the tool. Users translated
192 sentences from
French news stories into English. Most translators were faster and
better when assisted by
our tool. A detailed examination of the logs also provides
insight into the human translation process, such as time spent on different
activities
and length of pauses.
Keywords Computer-aided translation Interactive translation
Translation process
study Statistical machine translation