IAI Working paper no.36

[From: http://www.iai.uni-sb.de/iaien/iaiwp/index.htm#content]

IAI Working Paper No. 36

Hybrid Approaches to Machine Translation

Oliver Streiter, Michael Carl, Johann Haller (Editors)

Introduction

In order to offer a general overview over current trends in Machine Translation, IAI, the Institute of Applied Information Sciences, Saarbruecken, Germany has invited researchers to contribute to a collection of research papers related to the topic of "Hybrid Approaches to Machine Translation". While the 70ties and 80ties have been dominated by the Rule-Based Approach to Machine Translation (henceforth called RBMT), Corpus-Based approaches as Translation Memories (TMs), Statistical MT approaches (SBMT) and Example-based approaches to MT have been proposed and developed in the early 90ties. Electronic dictionaries and termtools as a third type of translation tools took off at the same time, as translators used the computer more and more for their daily work.
Although the supporters of the RBMT approach came under pressure due to the advantages the CBMT approaches have over RBMT, the first could find some additional support in the newly developing web-based information technology, such as multi-lingual information retrieval, on-line translation and multi-lingual visualization. It thus became apparent that all approaches have their individual strong and weak points, and that there are no ideal systems unless specified for a narrowly defined translation setting. In addition, if different approaches have different advantages they possibly can be combined such as to produce more satisfying MT-systems.

The RBMT Approach

The RBMT approach is introduced in this collection by a historical document entitled Linguistic and computational Motivations for the LOGOS Machine Translation System. In this contribution, Scott restates the belief of Rule-Based MT, that humans can develop an axiomatic representation for the translation process through intense inflection.

Also the contribution “Reusability of wide coverage linguistic resources in the construction of an English-Basque MT system” (in Spanish) and its English Extended Abstract) [1] by A.Diaz de Ilarraza, A. Mayor and K. Sarasola presents a Rule-Based MT System (English-Basque). This system has the particularity to be assembled out of different resources and machineries, which may be one approach to be followed when different MT-engines have to be combined into one system. For reasons such as the treatment of translation ambiguities, the authors intend to continue this assemblage by integrating methods based on statistics or examples.

Translation Memories

In his contribution Towards a Closer Integration of Termbases, Translation Memories, and Parallel Corpora - A Translation - Oriented View, Reinke takes a view from the translators point of view and complains about the current state of the art in TM technology. He argues for a stronger integration of termtools into TM. Termtools should be extended with 'implicit' translation knowledge contained in the texts. This knowledge should then be used in the alignment step to find subsentential translation units, which can increase performance of these TMs. He criticizes the skeleton approach by Langi} et. al. as being too shallow, and proposes a proper approach. However, current TM technologies not only don't use 'explicit' knowledge in termbases but also have no means to discover implicit terminological information or perform subsentential alignment.

In "Evaluierung der linguistischen Leistungsfähigkeit von Translation Memory-Systemen - Ein Erfahrungsbericht " (English Abstract), Reinke tackles the difficult question of similarity in TMs. Starting from the GLDV - workshop statement that evaluation criteria should be different for MT and TM technology, he argues that TMs have more in common with IR systems than with MT technology. Accordingly, TM evaluation can be handled with retrieval performance in terms of precision and recall. He shows that the TM-evaluation proposals in the EAGLES report are too vague and proposes to consider semantic, pragmatic and conceptual similarities in place of orthographic similarities only, as used in up-to-date TMs. In an original experiment he feeds TMs with lemmatized forms in order to come closer a well founded similarity notion.

Comparing and Linking different Architectures

Carl and Hansen pick up that idea in “Linking Translation Memories with Example-Based Machine Translation” [2] and compare the translation outcome of a lemmatized TM and a string-based TM (STM) with the translation outcome of an Example-Based Machine Translation (EBMT) system. Their intention is to find a setting where a linkage of the different translation paradigms is most reasonable. For each system, strengths and shortcomings are discussed and it is shown in what way translation ambiguities are produced. Based on a fully automatic evaluation method, they show that a combination of a TM and EBMT is more successful than any single system and that the linkage of STM and EBMT performs best.

In “Towards a Model of Competence for Corpus-Based Machine Translation” [3] Michael Carl defines translation as a multi-layered mapping of the source text description into an equivalent target text description. As different levels of description may require a controversial realization in the target language an ideal translations may not always be found. Accordingly he claims that unrestricted all-purpose MT system is unfeasibel. A number of corpus-based MT systems is examined according to the granularity, molecularity and representational richness they have. The expected translation performance is classified in terms of coverage and reliability. It is concluded that coarser representations are likely to achieve higher reliability while richer representations are due for broad coverage.

The contribution “Towards a Dynamic Linkage of Example-Based and Rule-Base Machine Translation” [4] reports a number of translation experiments that link TM and EBMT with RBMT. The authors (Carl, Iomdin, Pease and Streiter) start by examining the potentials of each MT paradigm in terms of translation quality and coverage, adaptability of the system and recall of TUs. The paper argues for a dynamic linkage of different MT paradigms where the decision is taken at runtime in what way to share the translation task amongst the systems. The paper motivates the communication of rich data-structures between system components and show that such a linkage leads to better translation results and improves customization of the system.

In the paper “A Unified Example-Based and Lexicalist Approach to Machine Translation”,[5] Turcato, McFetridge, Popowich and Toole propose a knowledge representation format and a system architecture that allow an effective integration of Example-Based and Lexicalist approaches to MT into a unified approach, which the authors call Example-Based Lexicalist Machine Translation (EBLMT). This approach tries to combine the advantages of each approach. From the point of view of LMT, it uses bilingual knowledge to drive parsing, providing additional information to solve syntactic ambiguities and prioritizing the parsing agenda in a more efficient way. From the point of view of an EBMT system like [Sato & Nagao1990], for instance, it allows the removal of the bilingual database's redundancy coming from the overlap of examples.

Statistics and Learning

“Learning, Forgetting and Remembering: Statistical Support for Rule-Based MT” [6] and Learning from Parallel Corpora: Experiments in Machine Translation [7] represent two consecutive studies in which the authors (Streiter, Iomdin, Hong and Hauck, resp. Iomdin and Streiter) investigate in how fare Rule-Based Machine Translation Systems can benefit from monolingual and bilingual corpora. Although monolingual corpora have obvious limitations, they can support the translation process if the corpora and the translation task can be classified as belonging to a subject domain. Bilingual corpora not only allow a better evaluation of translation probabilities, but also the acquisition of new one- and many-word translations. Two alignment techniques are proposed which do not require a sentence alignment before sub-sentential chunks are aligned.

Automatic Acquisition of a High-Precision Translation Lexicon from Parallel Chinese-English Corpora (written by Zhao-Ming Gao) contains a hybrid approach to deriving a translation lexicon from unaligned parallel Chinese-English corpora. Two types of information, namely, proximity and document-external distributions of word pairs, are proposed to enhance the precision of the translation lexicon derived from statistical and dictionary-based methods. The former can identify translations of Chinese compounds, while the latter can filter out spurious word correspondences.

Special linguistic phenomena of spoken language, the so-called closed captions, are the subject of the last paper. In “Explanation-based Learning for Machine Translation”,[8] Toole, Popowich, Nicholson, Turcato and McFetridge present an application of explanation-based learning (EBL) in the parsing module of a real-time English-Spanish machine translation system designed to translate closed captions. They discuss the efficiency/coverage trade-offs available in EBL and introduce techniques to increase coverage while maintaining a high level of space and time efficiency.

We hope to have assembled a first representative set of approaches to the research paradigm of hybrid Machine Translation which is an interesting field both from theoretical and from practical point of view.

April, 2000

Oliver Streiter

Michael Carl

Johann Haller

Academia Sinica
Institute of Information Science
Nankang, Taipei, Taiwan 115

IAI
Institute of Applied Information Sciences
66111 Saarbrücken, Germany

oliver@iis.sinica.edu.tw

carl@iai.uni-sb.de

hans@iai.uni-sb.de

Table of Content

Linguistic and computational Motivations for the LOGOS Machine Translation System
Bernard E. Scott (in English)

Reusability of wide coverage linguistic resources in the construction of an English-Basque MT system
Ilarraza, Mayor and Sarasola [see MT2000, Exter University]

Towards a Closer Integration of Termbases, Translation Memories, and Parallel Corpora - A Translation - Oriented View -
Uwe Reinke (in English)

Evaluierung der linguistischen Leistungsfähigkeit von Translation Memory-Systemen - Ein Erfahrungsbericht
Uwe Reinke (in German, English Abstract)

Linking Translation Memories with Example-Based Machine Translation
Michael Carl and Silvia Hansen (in English) [see MT Summit VII]

Towards a Model of Competence for Corpus-Based Machine Translation
Carl (in English) [see Coling 2000]

Towards a Dynamic Linkage of Example-Based and Rule-Base Machine Translation
Michael Carl, Leonid L. Iomdin, Cathrine Pease and Oliver Streiter (in English) [see ESSLLI 1998]

A Unified Example-Based and Lexicalist Approach to Machine Translation
Davide Turcato, Paul McFetridge, Fred Popowich and Janine Toole (in English) [see TMI-99]

Learning, Forgetting and Remembering: Statistical Support for Rule-Based MT
Oliver Streiter, Leonid L. Iomdin, Munpyo Hong and Ute Hauck (in English) [see TMI-99]

Learning from Parallel Corpora: Experiments in Machine Translation
Leonid L. Iomdin and Oliver Streiter (in English, abstract in Russian)

Explanation-based Learning for Machine Translation
Janine Toole, Fred Popowich, Devlan Nicholson, Davide Turcato and Paul McFetridge (in English) [see TMI-99]

Automatic Acquisition of a High-Precision Translation Lexicon from Parallel Chinese-English Corpora
Zhao-Ming Gao (in English)

Bibliography

The copyright of the contributions remains with the authors.

[1] The full paper in English presented at the BCS conference MT2000 at Exter University.

[2] Reprinted from paper given at MT Summit VII, 1999

[4] Reprinted from paper given at ESSLLI 1998.

[7] The original HTML file on the IAI website contains Cyrillic which cannot be transferred. However the older version presented in 1998 at ESSLLI is available here.

Oliver Streiter	Michael Carl	Johann Haller
Academia Sinica Institute of Information Science Nankang, Taipei, Taiwan 115	IAI Institute of Applied Information Sciences 66111 Saarbrücken, Germany
oliver@iis.sinica.edu.tw	carl@iai.uni-sb.de	hans@iai.uni-sb.de