Machine Translation, vol.21, no.1, March 2007, pp.1-28

A method of creating new valency entries

Sanae Fujita • Francis Bond

Received: 25 April 2006 / Accepted: 20 February 2008 / Published online: 28 June 2008
© Springer Science+Business Media B.V. 2008

Abstract Information on subcategorization and selectional restrictions in a valency dictionary is important for natural language processing tasks such as monolingual parsing, accurate rule-based machine translation and automatic summarization. In this paper we present an efficient method of assigning valency information and selectional restrictions to entries in a bilingual dictionary, based on information in an existing valency dictionary. The method is based on two assumptions: words with similar meaning have similar subcategorization frames and selectional restrictions; and words with the same translations have similar meanings. Based on these assumptions, new valency entries are constructed for words in a plain bilingual dictionary, using entries with similar source-language meaning and the same target-language translations. We evaluate the effects of various measures of semantic similarity.

Keywords Valency dictionary • Bilingual dictionary • Similarity • Merge

Machine Translation, vol.21, no.2, March 2007, pp.29-53

Methods for extracting and classifying pairs of cognates
and false friends

Ruslan Mitkov • Viktor Pekar • Dimitar Blagoev •
Andrea Millioni

Received: 18 January 2007 / Accepted: 27 February 2008 / Published online: 17 May 2008
© Springer Science+Business Media B.V. 2008

Abstract The identification of cognates has attracted the attention of researchers working in the area of Natural Language Processing, but the identification of false friends is still an under-researched area. This paper proposes novel methods for the automatic identification of both cognates and false friends from comparable bilingual corpora. The methods are not dependent on the existence of parallel texts, and make se of only monolingual corpora and a bilingual dictionary necessary for the mapping of co-occurrence data across languages. In addition, the methods do not require that the newly discovered cognates or false friends are present in the dictionary and hence are capable of operating on out-of-vocabulary expressions. These methods are evaluated on English, French, German and Spanish corpora in order to identify English-French, English-German, English-Spanish and French-Spanish pairs of cognates or false friends. The experiments were performed in two settings: (i) assuming ‘ideal’ extraction of cognates and false friends from plain-text corpora, i.e. when the evaluation data contains only cognates and false friends, and (ii) a real-world extraction scenario where cognates and false friends have to first be identified among words found in two comparable corpora in different languages. The evaluation results show that the developed methods identify cognates and false friends with very satisfactory results for both recall and precision, with methods that incorporate background semantic knowledge, in addition to co-occurrence data obtained from the corpora, delivering the best results.

Keywords Cognates • Faux amis • Orthographic similarity • Distributional similarity • Semantic similarity • Translational equivalence

Machine Translation, vol.21, no.1, March 2007, pp.55-68

Power shifts in web-based translation memory

Ignacio Garcia

Received: I August 2007 / Accepted: 20 February 2008 / Published online: 18 April 2008
© Springer Science+Business Media B.V. 2008

Abstract Web-based translation memory (TM) is a recent and little-studied development that is changing the way localisation projects are conducted. This article looks at the technology that allows for the sharing of TM databases over the internet to find out how it shapes the translator’s working environment. It shows that so-called pre-translation—until now the standard way for clients to manage translation tasks with freelancers—is giving way to web-interactive translation. Thus, rather than interacting with their own desktop databases as before, translators now interface with each other through server-based translation memories, so that a newly entered term or segment can be retrieved moments later by another translator working at a remote site. The study finds that, while the interests of most stakeholders in the localisation process are well served by this web-based arrangement, it can involve drawbacks for freelancers. Once an added value, technical expertise becomes less of a determining factor in employability, while translators lose autonomy through an inability to retain the linguistic assets they generate. Web-based TM is, therefore, seen to risk disempowering and de-skilling freelancers, relegating them from valued localisation partners to mere servants of the new technology.

Keywords Translation memory • Localization • Internationalization • Machine-aided translation • Web-based translation

Machine Translation, vol.21, no.2, June 2007, pp.77-94

Semi-supervised model adaptation for statistical
machine translation

Nicola Ueffing • Gholamreza Haffari • Anoop Sarkar

Received: 31 July 2007 / Accepted: 23 April 2008 / Published online: 10 June 2008
© Springer Science+Business Media B.V. 2008

Abstract Statistical machine translation systems are usually trained on large amounts of bilingual text (used to learn a translation model), and also large amounts of monolingual text in the target language (used to train a language model). In this article we explore the use of semi-supervised model adaptation methods for the effective use of monolingual data from the source language in order to improve translation quality. We propose several algorithms with this aim, and present the strengths and weaknesses of each one. We present detailed experimental evaluations on the French-English EuroParl data set and on data from the NIST Chinese-English large-data track. We show a significant improvement in translation quality on both tasks.

Keywords Statistical machine translation • Self-training • Semi-supervised learning • Domain adaptation • Model adaptation

Machine Translation, vol.21, no.2, June 2007, pp.95-119

Evaluating machine translation with LFG dependencies

Karolina Owczarzak • Josef van Genabith • Andy Way

Received: 31 October 2007 / Accepted: 22 May 2008 / Published online: 6 August 2008
© Springer Science+Business Media B.V. 2008

Abstract In this paper we show how labelled dependencies produced by a Lexical-Functional Grammar parser can be used in Machine Translation evaluation. In contrast to most popular evaluation metrics based on surface string comparison, our dependency-based method does not unfairly penalize perfectly valid syntactic variations in the translation, shows less bias towards statistical models, and the addition of WordNet provides a way to accommodate lexical differences. In comparison with other metrics on a Chinese-English newswire text, our method obtains high correlation with human scores, both on a segment and system level.

Keywords Machine translation • Evaluation metrics • Lexical-Functional Grammar • Labelled dependencies

Machine Translation, vol.21, no.2, June 2007, pp.121-133

Capturing practical natural language transformations

Kevin Knight

Received: 19 March 2008 / Accepted: 10 July 2008 / Published online: 6 August 2008
© Springer Science+Business Media B.V. 2008

Abstract We study automata for capturing the transformations in practical natural language processing (NLP) systems, especially those that translate between human languages. For several variations of finite-state string and tree transducers, we survey answers to formal questions about their expressiveness, modularity, teachability, and generalization. We conclude that no formal device yet captures everything that is desirable, and we point to future research.

Keywords Translation • Automata

Machine Translation, vol.21, no.3, September 2007, pp.139-163

Automatic extraction of translations from web-based bilingual materials

Qibo Zhu • Diana Inkpen • Ash Asudeh

Received: 14 September 2007 / Accepted: 7 August 2008 / Published online: 20 September 2008
© Springer Science+Business Media B.V. 2008

Abstract This paper describes the framework of the StatCan Daily Translation Extraction System (SDTES), a computer system that maps and compares web-based translation texts of Statistics Canada (StatCan) news releases in the StatCan publication The Daily. The goal is to extract translations for translation memory systems, for translation terminology building, for cross-language information retrieval and for corpus-based machine translation systems. Three years of officially published statistical news release texts at http://www.statcan.ca were collected to compose the StatCan Daily data bank. The English and French texts in this collection were roughly aligned using the Gale-Church statistical algorithm. After this, boundary markers of text segments and paragraphs were adjusted and the Gale-Church algorithm was run a second time for a more fine-grained text segment alignment. To detect misaligned areas of texts and to prevent mismatched translation pairs from being selected, key textual and structural properties of the mapped texts were automatically identified and used as anchoring features for comparison and misalignment detection. The proposed method has been tested with web-based bilingual materials from five other Canadian government websites. Results show that the SDTES model is very efficient in extracting translations from published government texts, and very accurate in identifying mismatched translations. With parameters tuned, the text-mapping part can be used to align corpus data collected from official government websites; and the text-comparing component can be applied in prepublication translation quality control and in evaluating the results of statistical machine translation systems.

Keywords Automatic translation extraction • Bitext mapping • Machine translation • Parallel alignment • Translation memory system

Machine Translation, vol.21, no.1, September 2007, pp.165-181

Pivot language approach for phrase-based statistical machine translation

Hua Wu • Haifeng Wang

Received: 1 April 2008 / Accepted: 11 August 2008 / Published online: 23 September 2008
© Springer Science+Business Media B.V. 2008

Abstract This paper proposes a novel method for phrase-based statistical machine translation based on the use of a pivot language. To translate between languages L_s and L_t with limited bilingual resources, we bring in a third language, L_p, called the pivot language. For the language pairs L_s—L_p and L_p — L_t, there exist large bilingual corpora. Using only L_s — L_p and L_p—L_t bilingual corpora, we can build a translation model for L_s — L_t. The advantage of this method lies in the fact that we can perform translation between L_s and L_t even if there is no bilingual corpus available for this language pair. Using BLEU as a metric, our pivot language approach significantly outperforms the standard model trained on a small bilingual corpus. Moreover, with a small L_s — L_t bilingual corpus available, our method can further improve translation quality by using the additional L_s — L_p and L_p — L_t bilingual corpora.

Keywords Pivot language • Phrase-based statistical machine translation • Scarce bilingual resources

Machine Translation, vol.21, no.4, December 2007, pp.187-207

Bilingual LSA-based adaptation for statistical machine translation

Yik-Cheung Tam • Ian Lane • Tanja Schultz

Received: 27 March 2008 / Accepted: 31 October 2008 / Published online: 19 November 2008
© Springer Science+Business Media B.V. 2008

Abstract We propose a novel approach to cross-lingual language model and translation lexicon adaptation for statistical machine translation (SMT) based on bilingual latent semantic analysis. Bilingual LSA enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bilingual LSA framework, model adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying the inferred distribution to an n-gram language model of the target language and translation lexicon via marginal adaptation. The background phrase table is enhanced with the additional phrase scores computed using the adapted translation lexicon. The proposed framework also features rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our approach is evaluated on the Chinese-English MT06 test set using the medium-scale SMT system and the GALE SMT system measured in BLEU and NIST scores. Improvement in both scores is observed on both systems when the adapted language model and the adapted translation lexicon are applied individually. When the adapted language model and the adapted translation lexicon are applied simultaneously, the gain is additive. At the 95% confidence interval of the unadapted baseline system, the gain in both scores is statistically significant using the medium-scale SMT system, while the gain in the NIST score is statistically significant using the GALE SMT system.

Machine Translation, vol.21, no.4, December 2007, pp.209-252

Simultaneous translation of lectures and speeches

Christian Fügen • Alex Waibel • Muntsin Kolss

Received: 27 August 2008 / Accepted: 4 November 2008 / Published online: 22 November 2008
© Springer Science+Business Media B.V. 2008

Abstract With increasing globalization, communication across language and cultural boundaries is becoming an essential requirement of doing business, delivering education, and providing public services. Due to the considerable cost of human translation services, only a small fraction of text documents and an even smaller percentage of spoken encounters, such as international meetings and conferences, are translated, with most resorting to the use of a common language (e.g. English) or not taking place at all. Technology may provide a potentially revolutionary way out if real-time, domain-independent, simultaneous speech translation can be realized. In this paper, we present a simultaneous speech translation system based on statistical recognition and translation technology. We discuss the technology, various system improvements and propose mechanisms for user-friendly delivery of the result. Over extensive component and end-to-end system evaluations and comparisons with human translation performance, we conclude that machines can already deliver comprehensible simultaneous translation output. Moreover, while machine performance is affected by recognition errors (and thus can be improved), human performance is limited by the cognitive challenge of performing the task in real time.