Parallel Text Processing: Alignment and Use of Translation Corpora

Parallel Text Processing: Alignment and Use of Translation Corpora

Edited by Jean Véronis (Université de Provence, Aix-en-Provence, France)

Dordrecht/Boston/London: Kluwer, 2000

Abstracts

From the Rosetta stone to the information society: A survey of parallel text processing

Jean Véronis (Université de Provence, France)

Keywords: Parallel texts, translation, corpora, alignment techniques, applications, evaluation

Abstract: This introductory chapter provides a survey of the processing and use of parallel texts, i.e., texts accompanied by their translation. Throughout the chapter, the various authors' contributions to the book are considered and related to the state of the art in the field. Three themes are addressed, corresponding to the three parts of the book: (i) techniques and methodology for the alignment of parallel texts at various levels such as sentences, clauses or words; (ii) applications of parallel texts in fields such as translation, lexicography, and information retrieval; and (iii) available corpus resources and evaluation of alignment methods.

Pattern recognition for mapping bitext correspondence

I. Dan Melamed (West Group, U.S.A.)

Keywords: Bitext geometry, pattern recognition, signal-to-noise ratio, portability

Abstract: The problem of finding token-level correspondences (bitext maps) between the two halves of a bitext can be formulated in terms of pattern recognition. From this point of view, effective solutions hinge on three tasks: signal generation, noise filtering and search. The Smooth Injective Map Recognizer (SIMR) algorithm presented here integrates innovative approaches to each of these tasks. Objective evaluation has shown that SIMR's accuracy is consistently high for language pairs as diverse as French/English and Chinese/English. If necessary, SIMR's bitext maps can be efficiently converted into segment alignments using the Geometric Segment Alignment (GSA) algorithm, which is also presented here. SIMR has produced bitext maps for over 200 megabytes of French-English bitexts. GSA has converted these maps into alignments. Both the maps and the alignments are available from the Linguistic Data Consortium.

Multilingual text alignment: Aligning three or more versions of a text

Michel Simard (Université de Montréal, Canada)

Keywords: Sentence alignment, word alignment, sentence alignment techniques, multilingual texts, multilingual alignment, English, French, Spanish

Abstract: This chapter addresses a number of questions regarding multilingual texts, where multilingual texts is taken as meaning texts represented in more than two languages. In particular, it raises the question of whether there is any real use for mapping out multilingual translation equivalence. The view that is proposed is that multiple versions of a text can (and should) be seen as additional sources of information that can effectively be exploited to produce better bilingual alignments. A general multilingual alignment technique is presented, whose computational complexity, for a given number of texts, is the same as that of bilingual alignment. Experimental results show how this method improves the accuracy of bilingual alignments on a trilingual corpus (The Gospel According to John, in English, French and Spanish).

A comprehensive bilingual word alignment system: Application to disparate languages: Hebrew and English

Yaacov Choueka, Ehud S. Conley and Ido Dagan (Bar-Ilan University, Israel)

Keywords: Parallel texts, translation, bilingual alignment, word alignment, Hebrew, English

Abstract: This chapter describes a general, comprehensive and robust word-alignment system and its application to the Hebrew-English language pair. A major goal of the system architecture is to assume as little as possible about its input and about the relative nature of the two languages, while allowing the use of (minimal) specific monolingual pre-processing resources when required. The system thus receives as input a pair of raw parallel texts and requires only a tokeniser (and possibly a lemmatiser) for each language. After tokenisation (and lemmatisation if necessary), a rough initial alignment is obtained for the texts using a version of Fung and McKeown's DK-vec algorithm (Fung & McKeown, 1997; Fung, this volume). The initial alignment is given as input to a version of the word_align algorithm (Dagan, Church and Gale, 1993), an extension of Model 2 in the IBM statistical translation model. Word_align produces a word level alignment for the texts and a probabilistic bilingual dictionary. The chapter describes the details of the system architecture, the algorithms implemented (emphasising implementation details), the issues regarding their application to Hebrew and similar Semitic languages, and some experimental results.

A knowledge-lite approach to word alignment

Lars Ahrenberg, Mikael Andersson and Magnus Merkel (Linköping University, Sweden)

Keywords: Word alignment, parallel corpora, translation studies, lexicography, Swedish

Abstract: The most promising approach to word alignment is to combine statistical methods with non-statistical information sources. Some of the proposed non-statistical sources, including bilingual dictionaries, POS-taggers and lemmatizers, rely on considerable linguistic knowledge, while other knowledge-lite sources such as cognate heuristics and word order heuristics can be implemented relatively easy. While knowledge-heavy sources might be expected to give better performance, knowledge-lite systems are easier to port to new language pairs and text types, and they can give sufficiently good results for many purposes, e.g. if the output is to be used by a human user for the creation of a complete word-aligned bitext. In this paper we describe the current status of the Linköping Word Aligner (LWA), which combines the use of statistical measures of co-occurrence with four knowledge-lite modules for (i) ) word categorization, (ii) morphological variation, (iii) word order, and (iv) phrase recognition. We demonstrate the portability of the system (from English-Swedish texts to French-English texts) and present results for these two language-pairs. Finally, we will report observations from an error analysis of system output, and identify the major strengths and weaknesses of the system.

From sentences to words and clauses

Stelios Piperidis, Harris Papageorgiou and Sotiris Boutsis⁽Institute for Language and Speech Processing, Greece; National Technical University of Athens, Greece)

Keywords: Sentence alignment, clause alignment, lexical equivalences extraction, lexical knowledge acquisition, translation memory, Greek, English

Abstract: This chapter addresses the issue of multilingual corpora alignment, presenting schemes which attempt alignment at sentence, clause, noun phrase and word level. Statistical inductive techniques are coupled with symbolic processing analysing specific language phenomena. Sentence alignment combines statistical techniques with the notion of semantic load of text units. Lexical equivalences are extracted based on morphosyntactic tagging and noun phrase recognition on each side of the parallel corpus. A statistical score then filters the most likely translation candidates of single and multi-word units. Similarly, clause alignment couples surface linguistic analysis with a probabilistic model based on word occurrence and co-occurrence probabilities, and word lengths. The best clause alignment is approximated by feeding all possible alignments into a dynamic programming framework. Word and clause alignment have been tested on English-Greek parallel corpora of different domains, yielding results exploitable in knowledge acquisition applications. Sentence alignment has been tested in several languages and integrated in a computer-aided translation platform maximizing translation reuse and consistency.

Bracketing and aligning words and constituents in parallel text using stochastic inversion transduction grammars

Dekai Wu (Hong Kong University of Science and Technology, Hong Kong)

Keywords: Word alignment, constituent alignment, bilingual language modeling, stochastic inversion transduction grammars, bilingual parsing

Abstract: We introduce (1) a novel stochastic inversion transduction grammar formalism for bilingual language modeling of sentence-pairs, and (2) the concept of bilingual parsing with a variety of parallel corpus analysis applications. Aside from the bi-lingual orientation, three major features distinguish the formalism from the finite-state transducers more traditionally found in computational linguistics: it skips directly to a context-free rather than finite-state base, it permits a minimal extra degree of ordering flexibility, and its probabilistic formulation admits an efficient maximum-likelihood bilingual parsing algorithm. A convenient normal form is shown to exist. Analysis of the formalism's expressiveness suggests that it is particularly well-suited to model ordering shifts between languages, balancing needed flexibility against complexity constraints. We discuss a number of examples of how stochastic inversion transduction grammars bring bilingual constraints to bear upon problematic corpus analysis tasks such as segmentation, bracketing, phrasal alignment, and parsing.

The translation network: A model for a fine-grained description of translations

Diana Santos (SINTEF Telecommunications and Informatics, Norway)

Keywords: Translation, contrastive studies, tense and aspect, descriptive models, Portuguese, English, coercion

Abstract: In this paper, I argue for the need for more complex models to describe actual translations (in aligned corpora) and present a particular proposal designed to accomplish such a description, termed the "translation network". A translation network joins models of the two languages involved, inspired by the aspectual networks of Moens (1987), and attempts to cater, in a systematic way, for many situations that occur in actual translation. This chapter is divided into four main sections. In Section 1 a brief description of many problems that are not addressed by simpler models of translation is presented. Section 2 represents the core of this chapter, describing how translation networks are composed. This description is illuminated with examples from literary translations between English and Portuguese. Section 3 critically reviews some problems with the model, before Section 4 concludes with a short defense of it.

Parallel text alignment using crosslingual information retrieval techniques

Christian Fluhr, Fréderique Bisson and Faïza Elkateb (CEA/DIST, France)

Keywords: Cross-language information retrieval, weighted boolean model, sentence alignment, word alignment, bilingual corpora, French, English

Abstract: In this chapter, we demonstrate that aligning a sentence with its translation is not fundamentally different from finding a sentence on the same topic in the target corpus, using the source sentence as a query. The two processes are based on the semantic proximity of two sentences in different languages, and their major difference is that information retrieval only needs to insure that the sentence found contains most of the information of the query, whereas sentence alignment requires that the parts that are not common to both languages be as small as possible. A crosslingual query system can be used to obtain candidates for sentence alignment, provided that the measure of semantic proximity slightly modified. More classical techniques can be used, taking sequential order into account, but our approach is very robust to text desynchronization, such as missing text segments in one language, or texts such as glossaries or indexes that are not in the same order in different languages.

Parallel alignment of structured documents

Laurent Romary and Patrice Bonhomme (Laboratoire Loria, France)

Keywords: SGML, XML-structured documents, structural alignment, multi-level alignment, TEI.

Abstract: Classical methods for parallel text alignment consider one specific level (e.g. sentences) at which two or more versions of a text are synchronised. This may lead to some problems when these documents are particularly long since alignment errors at some point in the text may, in the absence of any other linguistic information, propagate for some time without any chance of recovery. In this chapter we consider how multilingual parallel alignment can be based on the fact that more and more texts are now highly structured by means of tagging languages such as SGML. In particular we will describe recent efforts in multi-level alignment for which we will present the main advances as well as some of the difficulties to be dealt with, particularly when the text and its translation contain different encoding schemes or different encoding practices for the same scheme.

A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora

Pascale Fung (Hong Kong University of Science and Technology, Hong Kong)

Keywords: Parallel corpora, bilingual lexicon extraction, Chinese, English

Abstract: We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method—Convec. Convec is based on context information of a word to be translated. Even though the accuracy for top translation candidate is about 30% for 3 months of English and Chinese newspaper material, we show a dramatic increase of accuracy when we use a larger evaluation corpus in English and French. We find a 75% precision for the top three candidate translation of 75 content words, on English Wall Street Journal and French European News from different years.

Terminology extraction from parallel technical texts

Ingeborg Blank (University of Munich, Germany)

Keywords: Multilingual corpora, terminology extraction, lexical knowledge, alignment, French, German

Abstract: This chapter deals with the processing of a multilingual corpus of technical texts (patent documentation). As the relevant knowledge contained in such texts is concentrated in technical terms, the aim of the study is to extract special purpose terminology. A semi-automatic tool has been developed to help knowledge engineers, terminologists and professional translators not only to identify technical terms but also to detect possible translation equivalences and typical contexts of terms. Fully automatic bilingual term matching is not attempted. Language-specific terminology is defined by criteria suitable for an automatic procedure. Related studies in multilingual terminology extraction are also considered and the assumptions underlying these studies are examined on the corpus.

Term alignment in use: Machine-aided human translation

Éric Gaussier, David Hull and Salah Aït-Mokhtar (Xerox Research Centre Europe, France)

Keywords: Machine-aided human translation, translation memory, word alignment, terminology extraction, terminology alignment, English, French

Abstract: This chapter will describe how parallel text extraction algorithms can be used for machine aided translation, focusing on two particular applications: semi-automatic construction of bilingual terminology lexicons and translation memory. Automatic word alignment and terminology extraction algorithms can be combined to substantially speed the lexicon construction process. Using a highly accurate partial alignment of term constituents, a terminologist need only recognize and correct minor errors in the recognition of term boundaries. The next generation of translation memory systems will certainly use statistical alignment algorithms and shallow parsing technology to improve coverage of current systems, by allowing for linguistic abstraction and partial sentence matching. Abstracting away from lexical units to part-of-speech, number, term, or noun phrase classes will allow these systems to mix and match components.

Automatic dictionary extraction for cross-language information retrieval

Ralf D. Brown, Jaime G. Carbonell and Yiming Yang (Carnegie Mellon University Language Technologies Institute, U.S.A.)

Keywords: Bilingual dictionary extraction, cross-language information retrieval, Generalized Vector Space Model, Latent Semantic Indexing

Abstract: In experiments comparing a variety of different methods for cross-language information retrieval using a bilingual training corpus—methods based on both machine translation and "traditional" information-retrieval techniques—a fairly simple statistical technique for automatically extracting a bilingual dictionary from parallel text proved to have the best performance. Surprisingly, an improvement to the dictionary extraction method that significantly increases the accuracy of the dictionary proved to be slightly detrimental to overall performance even though it is highly beneficial for other applications. This chapter will describe the extraction method and its enhancement in detail, and compare the performance of a retrieval system using the automatically-generated dictionaries with other retrieval methods.

Parallel texts in computer-assisted language learning

John Nerbonne (University of Groningen, The Netherlands)

Keywords: Language learning, computer-assisted language learning, computer-aided instruction, vocabulary

Abstract: Parallel bilingual texts are a valuable source of information to advanced language learners, particularly in the area of lexis, subtle lexical dependencies. Typically this information is either not available or sporadically available only in very large dictionaries. To be most effective, the corpora in question should be indexed by lexeme (not string, or word form), and should be aligned into parallel sentences. This paper surveys use and prospects.

Japanese-English aligned bilingual corpora

Hitoshi Isahara and Masahiko Haruno (Communications Research Laboratory, Japan; ATR Human Information Processing Research Laboratories, Japan)

Keywords: Corpus development, parallel corpus, automatic sentence alignment, Japanese, English

Abstract: This chapter describes the bilingual corpora developed in Japan. First, we discuss problems of corpus development and some corpora which are, or will be, available in Japan. Next, we describe the bilingual corpus project of JEIDA (Japan Electronics Industry Development Association). The main purpose of this project is to develop a medium-sized aligned parallel corpus of English and Japanese. Also through this project, we are able to discuss various facets involved in the development of a bilingual corpus, to do research on the alignment of Japanese and English sentences and to investigate automatic acquisition of linguistic knowledge using the developed corpus. This chapter offers an overview of the automatic alignment system developed by NTT (Nippon Telegram and Telephone Co. Ltd.), which includes the entire alignment algorithm in detail. It also describes the graphical alignment environment BACCS in which the user can see the alignment results, and easily modify the results and the user dictionary.

Building a parallel corpus of English/Panjabi

Sukhdave Singh, Tony McEnery and Paul Baker (Lancaster University, United Kingdom)

Keywords: Corpus building, parallel corpus construction, writing system variation, scarcity of electronic text resources, TEI encoding, sentence alignment, Panjabi, English

Abstract: In this chapter we will be concerned primarily with the development of new parallel corpora, specifically for English paired with Indic languages. The focus of our discussion here will be Panjabi, though the issues we explore apply fairly equally to other Indie languages and scripts. We want to highlight a range of difficulties which face those constructing parallel corpus resources for the exploration of these languages, especially in the context of parallel corpora. In order to do this, two corpora—one of 16^th century Panjabi and one of modern Panjabi—will be described, and some preliminary work on English/Panjabi alignment briefly presented.

Sharing of translation memory databases derived from aligned parallel text

Alan K. Melby (Brigham Young University Translation Research Group, U.S.A.)

Keywords: TMX, localization, translation memory, XML, data interchange, standards, OSCAR, LISA

Abstract: Translation memory databases are used in order to avoid unnecessary re-translation of previously translated segments of text by automatic lookup and retrieval. Various commercial and in-house translation memory lookup tools derive translation memory databases from aligned parallel texts, but each tool uses a different internal representation for its translation memory database. In response to end-user requests, several developers of translation memory lookup tools have co-operated to define a standard intermediate format for exchanging translation memory databases from one translation technology application to another. This intermediate format, called TMX (Translation Memory eXchange), is an XML application and is thus platform independent and inspectable using a text editor. TMX can also be used as an intermediate format for aligned parallel texts in general, supporting reconstruction of original texts and optional separation of text and markup thanks to meta-markup tags. TMX was developed by OSCAR (Open Standards for Container/content Allowing Re-use), which is the data exchange standards group of LISA (Localization Industry Standards Association). The chapter describes translation memory databases, explains how they are used in the translation industry, and comments on the standard itself.

Evaluation of parallel text alignment systems: The ARCADE project

Jean Véronis and Philippe Langlais (Université de Provence, France; Université de Montréal, Canada)

Keywords: Parallel corpus, evaluation, sentence alignment, word alignment, English, French

Abstract: This chapter describes the ARCADE project, concerned with the evaluation of parallel text alignment systems. The project is composed of two tracks, devoted to the evaluation of alignment at sentence and word level respectively, and is planned for a four-year period. At the time of this report, twelve systems have participated in the sentence track, and five in the word track. Substantial progress has been made on the evaluation methodology, metrics and protocols, and a large reference corpus has been produced. The results show that sentence level alignment is quite satisfactory (over 98.5% accuracy on "normal" texts), although it degrades sharply for texts that do not match perfectly at the structural level (i.e., missing fragments, order differences, etc.). State-of-the-art word alignment systems can largely improve, since they reach only ca. 75% accuracy on the "translation spotting" task on which they were evaluated.