Machine Translation, vol 20, 2006 Abstracts

Machine Translation, vol.20, no.1, 2006, pp.1-23

EBMT by tree-phrasing

Philippe Langlais • Fabrizio Gotti

Received: 6 January 2006 / Accepted: 12 September 2006 /

Published online: 22 November 2006

Abstract This article presents an attempt to build a repository storing associations between simple syntactic dependency treelets in a source language and their corresponding phrases in a target language. We assess the usefulness of this resource in two different settings. First, we show that it improves upon a standard subsentential translation memory. Second, we observe improvements in translation quality when a standard statistical phrase-based translation engine is augmented with the ability to exploit such a repository.

Keywords Example-based machine translation • Translation memory • Statistical phrase-based machine translation

Machine Translation, vol.20, no.1, 2006, pp.25-41

Example-based machine translation based on tree-string
correspondence and statistical generation

Zhanyi Liu • Haifeng Wang • Hua Wu

Received: 19 December 2005 / Accepted: 13 September 2006 /

Published online: 2 November 2006

Abstract This paper describes an example-based machine translation (EBMT) method based on tree-string correspondence (TSC) and statistical generation. In this method, the translation example is represented as a TSC, which is a triple consisting of a parse tree in the source language, a string in the target language, and the correspondence between the leaf node of the source-language tree and the substring of the target-language string. For an input sentence to be translated, it is first parsed into a tree. Then the TSC forest which best matches the input tree is searched
for. Finally the translation is generated using a statistical generation model to combine the target-language strings of the TSCs. The generation model consists of three features: the semantic similarity between the tree in the TSC and the input tree, the translation probability of translating the source word into the target word, and the language-model probability for the target-language string. Based on the above method, we build an English-to-Chinese MT system. Experimental results indicate that the performance of our system is comparable with phrase-based statistical MT systems.

Keywords Example-based machine translation • Translation example • Tree-string correspondence • Statistical generation

Machine Translation, vol.20, no.1, 2006, pp.43-65

Dependency treelet translation: the convergence

of statistical and example-based machine-translation?

Christopher Quirk • Arul Menezes

Received: 5 January 2006 / Accepted: 25 August 2006 /

Published online: 8 February 2007

Abstract We describe a novel approach to MT that combines the strengths of the two leading corpus-based approaches: Phrasal SMT and EBMT. We use a syntactically informed decoder and reordering model based on the source dependency tree, in combination with conventional SMT models to incorporate the power of phrasal SMT with the linguistic generality available in a parser. We show that this approach significantly outperforms a leading string-based Phrasal SMT decoder and an EBMT system. We present results from two radically different language pairs, and investigate the sensitivity of this approach to parse quality by using two distinct parsers and oracle experiments. We also validate our automated bleu scores with a small human evaluation.

Keywords Example-based machine translation • EBMT • Statistical machine translation • SMT • Syntax • Dependency analysis

Machine Translation, vol.20, no.2, 2006, pp.67-79

Translators and TM: An investigation of translators'
perceptions of translation memory adoption

Sarah Dillon • Janet Fraser

Received: 9 January 2006 / Accepted: 4 August 2006 / Published online: 14 February 2007
© Springer Science+Business Media B.V. 2007

Abstract There has been little research on the role of translation memory (TM) in practitioners' working practices, apart from reviews and a survey into ownership and rates issues. The present study provides a comprehensive snapshot of the perceptions of UK-based professional translators with regard to TM as a tool in their working environment. Moore and Benbasat's instrument for measuring perceptions with regard to the adoption of an information technology innovation was adapted and used to investigate three hypotheses: that translators who are relatively new to the translation industry have a more positive general perception of TM than experienced translators; that translators who use TM have a more positive general perception of it than translators who do not; and, finally, that translators' perception of the value of TM is not linked with their perceived IT proficiency. The study found that younger translators took a positive general view of TM irrespective of actual use, in particular attributing esteem to more experienced translators using (or perceived to be using) TM. Non-users at all experience levels, however, had a negative general view of TM irrespective of actual use. Both findings point to the significance of adequate knowledge in adoption decisions. Perceived IT proficiency, finally, was found to play a key role in translators' perceptions of the benefits of TM. These findings are discussed in the light of recent trends in the translation industry, including Continuing Professional Development, quality assurance and regulation.

Keywords Translators • Translation memory • Translation tools • Technology adoption • Translator training

Machine Translation, vol.20, no.2, 2006, pp.81-138

Syntactic mismatches in machine translation

Igor Mel’čuk • Leo Wanner

Received: 13 August 2005 / Accepted: 31 August 2006 / Published online: 14 March 2007
© Springer Science+Business Media B.V. 2007

Abstract This paper addresses one of the central problems arising at the transfer stage in machine translation: syntactic mismatches, that is, mismatches between a source-language sentence structure and its equivalent target-language sentence structure. The level at which we assume the transfer to be carried out is the Deep-Syntactic Structure (DSyntS) as proposed in the Meaning-Text Theory (MTT). DSyntS is abstract enough to avoid all types of divergences that result either from restricted lexical co-occurrence or from surface-syntactic discrepancies between languages. As for the remaining types of syntactic divergences, all of them occur not only interlinguistically, but also intralinguistically; this means that establishing correspondences between semantically equivalent expressions of the source and target languages that diverge with respect to their syntactic structure is nothing else than paraphrasing. This allows us to adapt the powerful intralinguistic paraphrasing mechanism developed in MTT for purposes of interlinguistic transfer.

Keywords Transfer • Syntactic mismatch • Paraphrasing • Deep-syntactic structure • Meaning-Text Theory

Machine Translation, vol.20, no.3, 2006, pp.147-166

Improving phrase-based statistical machine translation
with morphosyntactic transformation

Thai Phuong Nguyen • Akira Shimazu

Received: 11 October 2006 / Accepted: 3 April 2007 / Published online: 6 September 2007
© Springer Science+Business Media B.V. 2007

Abstract We present a phrase-based statistical machine translation approach which uses linguistic analysis in the preprocessing phase. The linguistic analysis includes morphological transformation and syntactic transformation. Since the word-order problem is solved using syntactic transformation, there is no reordering in the decoding phase. For morphological transformation, we use hand-crafted transformational rules. For syntactic transformation, we propose a transformational model based on a probabilistic context-free grammar. This model is trained using a bilingual corpus and a broad-coverage parser of the source language. This approach is applicable to language pairs in which the target language is poor in resources. We considered translation from English to Vietnamese and from English to French. Our experiments showed significant BLEU-score improvements in comparison with Pharaoh, a state-of-the-art phrase-based SMT system.

Keywords Phrase-based statistical MT • Reordering • Preprocessing • Morphological transformation • Syntactic transformation • Pharaoh decoder • BLEU score

Machine Translation, vol.20, no.3, 2006, pp.167-197

EXTRA: a system for example-based translation
assistance

Federica Mandreoli • Riccardo Martoglia •
Paolo Tiberio

Received: 25 September 2006 / Accepted: 24 May 2007 / Published online: 20 July 2007
© Springer Science+Business Media B.V. 2007

Abstract In this paper we present EXTRA (EXample-based Translation Assistant), a translation memory (TM) system. EXTRA is able to propose effective translation suggestions by relying on syntactic analysis of the text and on a rigorous, language-independent measure; the search is performed efficiently in large amounts of bilingual texts thanks to its advanced retrieval techniques. EXTRA does not use external knowledge requiring the intervention of users and is completely customizable and portable as it has been implemented on top of a standard DataBase Management System. The paper provides a thorough evaluation of both the effectiveness and the efficiency of our system. In particular, in order to quantify the benefits offered by EXTRA assisted translation over manual translation, we introduce a simulator implementing specifically devised statistical, process-oriented, discrete-event models. As far as we know, this is the first time statistical simulation experiments have been used to face the nontrivial problem of evaluating TM systems, particularly for comparing the time that could be saved by performing assisted translation versus "manual" translation and for optimally tuning the system behaviour with respect to differently skilled users. In our experiments, we considered three scenarios, manual translation with one or two translators and assisted translation with one translator. The time needed for one translator to do an assisted translation is significantly closer to that of a team of two translators than to that of the single translator. The mean sentence translation time is by far the lowest for this scenario, corresponding to the highest per translator productivity. We also estimate the total translation time when the number of query sentences, the maximum number of suggestions to be read, and the probability of look up are varied: the best trade-off is given by reading (and presenting) four or five suggestions at the most.

Keywords Textual data management • Text search and retrieval • Translation memory • Effectiveness and efficiency of translation assistance

Machine Translation, vol.20, no.3, 2006, pp.199-215

Improving statistical MT by coupling reordering
and decoding

Josep Maria Crego • José B. Mariño

Received: 11 October 2006 / Accepted: 24 May 2007 / Published online: 12 July 2007
© Springer Science+Business Media B.V. 2007

Abstract In this paper we describe an elegant and efficient approach to coupling reordering and decoding in statistical machine translation, where the n-gram translation model is also employed as distortion model. The reordering search problem is tackled through a set of linguistically motivated rewrite rules, which are used to extend a monotonic search graph with reordering hypotheses. The extended graph is traversed in the global search when a fully informed decision can be taken. Further experiments show that the n-gram translation model can be successfully used as reordering model when estimated with reordered source words. Experiments are reported on the Europarl task (Spanish-English and English-Spanish). Results are presented regarding translation accuracy and computational efficiency, showing significant improvements in translation quality with respect to monotonic search for both translation directions at a very low computational cost.

Keywords Statistical MT • Reordering • Decoding • n-gram language models

Machine Translation, vol.20, no.4, 2006, pp.227-245

Automatic induction of bilingual resources from aligned
parallel corpora: application to shallow-transfer
machine translation

Helena M. Caseli • Maria das Graças V. Nunes •
Mikel L. Forcada

Received: 28 May 2007 / Accepted: 14 November 2007 / Published online: 4 January 2008
© Springer Science+Business Media B.V. 2007

Abstract The availability of machine-readable bilingual linguistic resources is crucial not only for rule-based machine translation but also for other applications such as cross-lingual information retrieval. However, the building of such resources (bilingual single-word and multi-word correspondences, translation rules) demands extensive manual work, and, as a consequence, bilingual resources are usually more difficult to find than "shallow" monolingual resources such as morphological dictionaries or part-of-speech taggers, especially when they involve a less-resourced language. This paper describes a methodology to build automatically both bilingual dictionaries and shallow-transfer rules by extracting knowledge from word-aligned parallel
corpora processed with shallow monolingual resources (morphological analysers, and part-of-speech taggers). We present experiments for Brazilian Portuguese-Spanish and Brazilian Portuguese-English parallel texts. The results show that the proposed methodology can enable the rapid creation of valuable computational resources (bilingual dictionaries and shallow-transfer rules) for machine translation and other natural language processing tasks).

Keywords Machine translation • Automatic induction • Transfer rule • Bilingual dictionary • Shallow transfer

Machine Translation, vol.20, no.4, 2006, pp.247-266

Finding translations for low-frequency words
in comparable corpora

Viktor Pekar • Ruslan Mitkov • Dimitar Blagoev •
Andrea Mulloni

Received: 3 July 2007 / Accepted: 5 December 2007 / Published online: 23 February 2008
© Springer Science+Business Media B.V. 2008

Abstract Statistical methods to extract translational equivalents from non-parallel corpora hold the promise of ensuring the required coverage and domain customisation of lexicons as well as accelerating their compilation and maintenance. A challenge for these methods are rare, less common words and expressions, which often have low corpus frequencies. However, it is rare words such as newly introduced terminology and named entities that present the main interest for practical lexical acquisition. In this article, we study possibilities of improving the extraction of low-frequency equivalents from bilingual comparable corpora. Our work is carried out in the general framework which discovers equivalences between words of different languages using similarities between their occurrence patterns found in respective monolingual corpora. We develop a method that aims to compensate for insufficient amounts of corpus evidence on rare words: prior to measuring cross-language similarities, the method uses same-language corpus data to model co-occurrence vectors of rare words by predicting their unseen co-occurrences and smoothing rare, unreliable ones. Our experimental evaluation demonstrates that the proposed method delivers a consistent and significant improvement on the conventional approach to this task.

Keywords Lexical acquisition • Translational equivalents • Comparable corpora • Distributional similarity • Data sparseness

Machine Translation, vol.20, no.4, 2006, pp.267-289

Implementing NLP projects for noncentral languages:
instructions for funding bodies, strategies for developers

Oliver Streiter • Kevin P. Scannell • Mathias Stuflesser

Received: 6 April 2006 / Accepted: 15 August 2007 / Published online: 6 December 2007
© Springer Science+Business Media B.V. 2007

Abstract This research begins by distinguishing a small number of "central" languages from the "noncentral languages", where centrality is measured by the extent to which a given language is supported by natural language processing tools and research. We analyse the conditions under which noncentral language projects (NCLPs) and central language projects are conducted. We establish a number of important differences which have far-reaching consequences for NCLPs. In order to overcome the difficulties inherent in NCLPs, traditional research strategies have to be reconsidered. Successful styles of scientific cooperation, such as those found in open-source software development or in the development of the Wikipedia, provide alternative views of how NCLPs might be designed. We elaborate the concepts of free software and software pools and argue that NCLPs, in their own interests, should embrace an open-source approach for the resources they develop and pool these resources together with other similar open-source resources. The expected advantages of this approach are so important that we suggest that funding organizations put it as sine qua non condition into project contracts.

Keywords Minority languages • Open-source • Free software • Software pools