Machine
Translation, vol.21, no.1, March 2007, pp.1-28
A method of creating new valency entries
Sanae Fujita Francis Bond
Received: 25 April 2006 / Accepted: 20 February 2008 / Published
online: 28 June 2008
© Springer Science+Business Media B.V. 2008
Abstract Information on subcategorization and selectional
restrictions in a valency dictionary is important for natural
language processing tasks such as monolingual parsing,
accurate rule-based machine translation and automatic summarization. In this paper
we present an efficient method of assigning valency
information and selectional restrictions to
entries in a bilingual dictionary, based on information in an existing valency dictionary. The method is based on two
assumptions: words with similar meaning
have similar subcategorization frames and selectional restrictions; and words with the same translations have similar meanings.
Based on these assumptions, new valency entries are constructed for words in a plain bilingual dictionary,
using entries with similar source-language meaning and the same target-language
translations. We evaluate the
effects of various measures of semantic similarity.
Keywords Valency dictionary
Bilingual dictionary Similarity Merge
Machine
Translation, vol.21, no.2, March 2007, pp.29-53
Methods for extracting and classifying pairs of cognates
and
false friends
Ruslan Mitkov Viktor Pekar Dimitar Blagoev
Andrea Millioni
Received: 18 January 2007 / Accepted: 27 February 2008 / Published
online: 17 May 2008
© Springer Science+Business Media B.V. 2008
Abstract The identification of cognates has
attracted the attention of researchers working
in the area of Natural Language Processing, but the identification of false friends
is still an under-researched area. This paper proposes novel methods for the automatic identification of both cognates and
false friends from comparable bilingual corpora. The methods are not dependent on the existence of parallel
texts, and make se of only
monolingual corpora and a bilingual dictionary necessary for the mapping of co-occurrence data across languages. In
addition, the methods do not require that the newly discovered cognates or false friends are present in the dictionary
and hence are capable of operating on out-of-vocabulary expressions. These
methods are evaluated on English, French, German and Spanish corpora in order
to identify English-French, English-German,
English-Spanish and French-Spanish pairs of cognates or false friends. The experiments were performed in two
settings: (i) assuming ideal extraction of
cognates and false friends from plain-text corpora, i.e. when the evaluation data contains only cognates and false friends, and
(ii) a real-world extraction scenario where
cognates and false friends have to first be identified among words found in two comparable corpora in different languages.
The evaluation results show that the
developed methods identify
cognates and false friends with very satisfactory results for both recall and
precision, with methods that incorporate background semantic knowledge, in
addition to co-occurrence data obtained from the corpora, delivering the best
results.
Keywords Cognates Faux amis Orthographic similarity Distributional similarity Semantic similarity Translational
equivalence
Machine
Translation, vol.21, no.1, March 2007, pp.55-68
Power shifts in web-based translation memory
Ignacio Garcia
Received: I August 2007 / Accepted: 20 February 2008 / Published
online: 18 April 2008
© Springer Science+Business Media B.V. 2008
Abstract Web-based translation
memory (TM) is a recent and little-studied development that is changing the way localisation projects are
conducted. This article looks at the technology that allows for the sharing of
TM databases over the internet to find out
how it shapes the translators working environment. It shows that so-called pre-translationuntil now the standard
way for clients to manage translation tasks with freelancersis giving way to
web-interactive translation. Thus, rather than interacting with their own desktop databases as before, translators now
interface with each other through server-based translation memories, so
that a newly entered term or segment can be
retrieved moments later by another translator working at a remote site. The study finds that, while the interests
of most stakeholders in the localisation process are well served by this web-based arrangement, it can involve
drawbacks for freelancers. Once an
added value, technical expertise becomes less of a determining factor in employability, while translators lose
autonomy through an inability to retain the linguistic assets they generate. Web-based TM is, therefore, seen to
risk disempowering and de-skilling
freelancers, relegating them from valued localisation partners to mere servants
of the new technology.
Keywords
Translation memory Localization Internationalization Machine-aided
translation Web-based translation
Machine
Translation, vol.21, no.2, June 2007, pp.77-94
Semi-supervised model adaptation for statistical
machine translation
Nicola Ueffing Gholamreza Haffari Anoop Sarkar
Received: 31 July 2007 / Accepted: 23 April 2008 / Published online: 10
June 2008
© Springer Science+Business Media B.V. 2008
Abstract Statistical machine translation systems are
usually trained on large amounts of bilingual text (used to learn a translation
model), and also large amounts of monolingual text in the target language (used to
train a language model). In this article we explore the use of semi-supervised
model adaptation methods for the effective use of monolingual data from the
source language in order to improve translation quality. We propose several
algorithms with this aim, and present the strengths and weaknesses of each one. We
present detailed experimental evaluations on the French-English EuroParl data set and on data from the NIST Chinese-English
large-data
track. We show a significant improvement in translation quality on both tasks.
Keywords Statistical machine translation
Self-training Semi-supervised learning Domain adaptation Model
adaptation
Machine
Translation, vol.21, no.2, June 2007, pp.95-119
Evaluating machine translation with LFG dependencies
Karolina Owczarzak Josef van Genabith Andy Way
Received: 31 October 2007 / Accepted: 22 May 2008 / Published online: 6
August 2008
© Springer Science+Business Media B.V. 2008
Abstract In this paper we
show how labelled dependencies produced by a Lexical-Functional Grammar parser can
be used in Machine Translation evaluation. In contrast to most popular evaluation
metrics based on surface string comparison, our dependency-based method does not
unfairly penalize perfectly valid syntactic variations in the translation, shows less
bias towards statistical models, and the addition of WordNet provides a way to accommodate lexical differences.
In comparison with other metrics on a Chinese-English newswire text, our method obtains
high correlation with human scores, both on a segment and system level.
Keywords Machine translation
Evaluation metrics Lexical-Functional Grammar Labelled
dependencies
Machine
Translation, vol.21, no.2, June 2007, pp.121-133
Capturing practical natural language transformations
Kevin Knight
Received: 19 March 2008 / Accepted: 10 July 2008 / Published online: 6
August 2008
© Springer Science+Business Media B.V. 2008
Abstract We study automata for
capturing the transformations in practical natural language processing (NLP)
systems, especially those that translate between human languages. For
several variations of finite-state string and tree transducers, we survey answers
to formal questions about their expressiveness, modularity, teachability,
and generalization.
We conclude that no formal device yet captures everything that is desirable,
and we point to future research.
Keywords Translation
Automata
Machine
Translation, vol.21, no.3, September 2007, pp.139-163
Automatic extraction of translations from web-based bilingual materials
Qibo Zhu Diana Inkpen Ash Asudeh
Received: 14 September 2007 / Accepted: 7 August 2008 / Published
online: 20 September 2008
© Springer Science+Business Media B.V. 2008
Abstract This paper describes the framework of the StatCan Daily Translation Extraction System (SDTES), a computer
system that maps and compares web-based translation texts of Statistics Canada (StatCan) news releases in the StatCan
publication
The Daily. The goal is to extract translations for translation memory
systems,
for translation terminology building, for cross-language information retrieval and for corpus-based
machine translation systems. Three years of officially published statistical news
release texts at http://www.statcan.ca were collected to compose the StatCan Daily data
bank. The English and French texts in this collection were roughly aligned using the
Gale-Church statistical algorithm. After this, boundary markers of text segments and
paragraphs were adjusted and the Gale-Church algorithm was run a second time for a
more fine-grained text segment alignment. To detect misaligned areas of texts
and to prevent mismatched translation pairs from being selected, key textual and
structural properties of the mapped texts were automatically identified and used as
anchoring features for comparison and misalignment detection. The proposed method has been
tested with web-based bilingual materials from five other Canadian
government websites. Results show that the SDTES model is very efficient in
extracting translations from published government texts, and very accurate in
identifying mismatched translations. With parameters tuned, the text-mapping
part can be used to
align corpus data collected from official government websites; and the text-comparing component can be applied in
prepublication translation quality control and in
evaluating the results of statistical machine translation systems.
Keywords Automatic translation extraction Bitext mapping Machine translation Parallel alignment
Translation memory system
Machine
Translation, vol.21, no.1, September 2007, pp.165-181
Pivot language approach for phrase-based statistical machine translation
Hua Wu Haifeng Wang
Received: 1 April 2008 / Accepted: 11 August 2008 / Published online:
23 September 2008
© Springer Science+Business Media B.V. 2008
Abstract This paper proposes a novel method for
phrase-based statistical machine translation based on the use of a pivot language.
To translate between languages Ls
and Lt with limited bilingual
resources, we bring in a third language, Lp,
called the pivot language. For the language pairs LsLp and Lp Lt, there exist
large bilingual corpora. Using only Ls Lp
and LpLt bilingual
corpora, we can build a translation model
for Ls Lt. The advantage of this method lies in
the fact that we can perform translation between Ls and
Lt even if there is no bilingual corpus available for this language
pair. Using BLEU as a metric, our pivot language approach significantly outperforms the standard model trained on a small
bilingual corpus. Moreover, with a small
Ls Lt bilingual corpus available, our method can
further improve translation quality
by using the additional Ls Lp
and Lp Lt bilingual
corpora.
Keywords
Pivot language Phrase-based statistical machine translation Scarce bilingual resources
Machine
Translation, vol.21, no.4, December 2007, pp.187-207
Bilingual LSA-based adaptation for statistical machine translation
Yik-Cheung Tam
Received: 27 March 2008 / Accepted: 31 October 2008 / Published online:
19 November 2008
© Springer Science+Business Media B.V. 2008
Abstract We propose a novel
approach to cross-lingual language model and translation lexicon adaptation for
statistical machine translation (SMT) based on bilingual latent semantic analysis. Bilingual
LSA enables latent topic distributions to be efficiently transferred across
languages by enforcing a one-to-one topic correspondence during training. Using the proposed bilingual LSA framework,
model adaptation can be performed
by, first, inferring the topic posterior distribution of the source text and
then applying the inferred distribution to an n-gram language model of the
target language and translation
lexicon via marginal adaptation. The background phrase table is enhanced with the additional phrase scores
computed using the adapted translation lexicon. The proposed framework
also features rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our
approach is evaluated on the
Chinese-English MT06 test set using the medium-scale SMT system and the GALE SMT system measured in BLEU and NIST
scores. Improvement in both scores is
observed on both systems when the adapted language model and the adapted translation
lexicon are applied individually. When the adapted language model and the adapted translation lexicon are applied
simultaneously, the gain is additive. At the 95% confidence interval of the unadapted
baseline system, the gain in both scores is statistically significant
using the medium-scale SMT system, while the gain in the NIST score is
statistically significant using the GALE SMT system.
Keywords Bilingual latent semantic analysis Latent Dirichlet-tree allocation Cross-lingual language model
adaptation Lexicon adaptation Topic distribution transfer
Statistical machine translation
Machine
Translation, vol.21, no.4, December 2007, pp.209-252
Simultaneous translation of lectures and speeches
Christian Fόgen Alex Waibel Muntsin Kolss
Received: 27 August 2008 / Accepted: 4 November 2008 / Published
online: 22 November 2008
© Springer Science+Business Media B.V. 2008
Abstract With increasing globalization, communication
across language and cultural boundaries is becoming an essential requirement of
doing business, delivering education, and providing public services. Due to the
considerable cost of human translation services, only a small fraction of text
documents and an even smaller percentage of spoken encounters, such as
international meetings and conferences, are translated, with most resorting to
the use of a common language (e.g. English) or not taking place at all.
Technology may provide a potentially revolutionary way out if real-time, domain-independent,
simultaneous speech translation can be realized. In this paper, we present a
simultaneous speech translation system based on statistical recognition and translation
technology. We discuss the technology, various system improvements and propose
mechanisms for user-friendly delivery of the result. Over extensive component and
end-to-end system evaluations and comparisons with human translation performance,
we conclude that machines can already deliver comprehensible simultaneous translation
output. Moreover, while machine performance is affected by recognition errors
(and thus can be improved), human performance is limited by the cognitive challenge
of performing the task in real time.
Keywords Simultaneous translation Interpretation
Speech-to-speech translation Spoken language translation Machine translation
Speech recognition Lecture recognition Lectures Speeches