Parallel Text Processing: Alignment and Use of Translation Corpora
Edited by Jean Véronis
(Université de Provence,
Dordrecht/Boston/London: Kluwer,
From the Rosetta
stone to the information society: A
survey of parallel text processing
Jean Véronis (
Keywords: Parallel texts, translation, corpora, alignment techniques, applications, evaluation
Abstract: This introductory chapter provides a survey of the processing and use of parallel texts, i.e., texts accompanied by their translation. Throughout the chapter, the various authors' contributions to the book are considered and related to the state of the art in the field. Three themes are addressed, corresponding to the three parts of the book: (i) techniques and methodology for the alignment of parallel texts at various levels such as sentences, clauses or words; (ii) applications of parallel texts in fields such as translation, lexicography, and information retrieval; and (iii) available corpus resources and evaluation of alignment methods.
Pattern recognition for mapping bitext correspondence
I. Dan Melamed (West Group,
Keywords: Bitext geometry, pattern recognition, signal-to-noise ratio, portability
Abstract: The problem of finding token-level correspondences (bitext maps) between the two halves of a bitext can be formulated in terms of pattern recognition.
From this point of view, effective solutions hinge on three tasks: signal
generation, noise filtering and search. The Smooth Injective Map Recognizer
(SIMR) algorithm presented here integrates innovative approaches to each of
these tasks. Objective evaluation has shown that SIMR's
accuracy is consistently high for language pairs as diverse as French/English
and Chinese/English. If necessary, SIMR's bitext maps can be efficiently converted into segment
alignments using the Geometric Segment Alignment (GSA) algorithm, which is also
presented here. SIMR has produced bitext maps for
over 200 megabytes of French-English bitexts. GSA has
converted these maps into alignments. Both the maps and the alignments are
available from the Linguistic Data Consortium.
Multilingual text alignment: Aligning three or more versions of a
Michel Simard (
Keywords: Sentence alignment, word alignment, sentence alignment techniques, multilingual texts, multilingual alignment, English, French, Spanish
Abstract: This chapter addresses a number of questions regarding multilingual texts, where multilingual texts is taken as meaning texts represented in more than two languages. In particular, it raises the question of whether there is any real use for mapping out multilingual translation equivalence. The view that is proposed is that multiple versions of a text can (and should) be seen as additional sources of information that can effectively be exploited to produce better bilingual alignments. A general multilingual alignment technique is presented, whose computational complexity, for a given number of texts, is the same as that of bilingual alignment. Experimental results show how this method improves the accuracy of bilingual alignments on a trilingual corpus (The Gospel According to John, in English, French and Spanish).
A comprehensive bilingual word alignment system:
Application to disparate languages: Hebrew and English
Yaacov Choueka,
Ehud S. Conley and Ido Dagan (
Keywords: Parallel texts, translation, bilingual alignment, word alignment, Hebrew, English
Abstract: This chapter
describes a general, comprehensive and robust word-alignment system and its
application to the Hebrew-English language pair. A major goal of the system
architecture is to assume as little as possible about its input and about the
relative nature of the two languages, while allowing the use of (minimal)
specific monolingual pre-processing resources when required. The system thus
receives as input a pair of raw parallel texts and requires only a tokeniser (and possibly a lemmatiser)
for each language. After tokenisation (and lemmatisation if necessary), a rough
initial alignment is obtained for the texts using a version of Fung and McKeown's DK-vec algorithm
(Fung & McKeown, 1997; Fung, this volume). The
initial alignment is given as input to a version of the word_align
algorithm (Dagan, Church and Gale, 1993), an extension of Model 2 in the
IBM statistical translation model. Word_align
produces a word level alignment for the texts and a probabilistic bilingual
dictionary. The chapter describes the details of the system architecture, the
algorithms implemented (emphasising implementation details), the issues
regarding their application to Hebrew and similar Semitic languages, and some
experimental results.
A knowledge-lite approach to word alignment
Lars Ahrenberg, Mikael Andersson and Magnus Merkel (
Keywords: Word alignment, parallel corpora, translation studies, lexicography, Swedish
Abstract: The most promising approach to word alignment is to combine statistical methods with non-statistical information sources. Some of the proposed non-statistical sources, including bilingual dictionaries, POS-taggers and lemmatizers, rely on considerable linguistic knowledge, while other knowledge-lite sources such as cognate heuristics and word order heuristics can be implemented relatively easy. While knowledge-heavy sources might be expected to give better performance, knowledge-lite systems are easier to port to new language pairs and text types, and they can give sufficiently good results for many purposes, e.g. if the output is to be used by a human user for the creation of a complete word-aligned bitext. In this paper we describe the current status of the Linköping Word Aligner (LWA), which combines the use of statistical measures of co-occurrence with four knowledge-lite modules for (i) ) word categorization, (ii) morphological variation, (iii) word order, and (iv) phrase recognition. We demonstrate the portability of the system (from English-Swedish texts to French-English texts) and present results for these two language-pairs. Finally, we will report observations from an error analysis of system output, and identify the major strengths and weaknesses of the system.
From sentences to words and clauses
Stelios Piperidis, Harris Papageorgiou
and Sotiris Boutsis (Institute for
Language and Speech Processing,
Keywords: Sentence alignment, clause alignment, lexical equivalences extraction, lexical knowledge acquisition, translation memory, Greek, English
Abstract: This chapter addresses the issue of multilingual corpora alignment, presenting schemes which attempt alignment at sentence, clause, noun phrase and word level. Statistical inductive techniques are coupled with symbolic processing analysing specific language phenomena. Sentence alignment combines statistical techniques with the notion of semantic load of text units. Lexical equivalences are extracted based on morphosyntactic tagging and noun phrase recognition on each side of the parallel corpus. A statistical score then filters the most likely translation candidates of single and multi-word units. Similarly, clause alignment couples surface linguistic analysis with a probabilistic model based on word occurrence and co-occurrence probabilities, and word lengths. The best clause alignment is approximated by feeding all possible alignments into a dynamic programming framework. Word and clause alignment have been tested on English-Greek parallel corpora of different domains, yielding results exploitable in knowledge acquisition applications. Sentence alignment has been tested in several languages and integrated in a computer-aided translation platform maximizing translation reuse and consistency.
Bracketing and aligning words and constituents in parallel text using
stochastic inversion transduction grammars
Dekai Wu (
Keywords: Word alignment, constituent alignment, bilingual language modeling, stochastic inversion transduction grammars, bilingual parsing
Abstract: We introduce (1) a novel stochastic inversion transduction grammar formalism for bilingual language modeling of sentence-pairs, and (2) the concept of bilingual parsing with a variety of parallel corpus analysis applications. Aside from the bi-lingual orientation, three major features distinguish the formalism from the finite-state transducers more traditionally found in computational linguistics: it skips directly to a context-free rather than finite-state base, it permits a minimal extra degree of ordering flexibility, and its probabilistic formulation admits an efficient maximum-likelihood bilingual parsing algorithm. A convenient normal form is shown to exist. Analysis of the formalism's expressiveness suggests that it is particularly well-suited to model ordering shifts between languages, balancing needed flexibility against complexity constraints. We discuss a number of examples of how stochastic inversion transduction grammars bring bilingual constraints to bear upon problematic corpus analysis tasks such as segmentation, bracketing, phrasal alignment, and parsing.
The translation
network: A model for a fine-grained description of
Diana Santos (SINTEF Telecommunications and
Keywords: Translation, contrastive studies, tense and aspect, descriptive models, Portuguese, English, coercion
Abstract: In this paper, I argue for the need for more complex models
to describe actual translations (in aligned corpora) and present a particular
proposal designed to accomplish such a description, termed the "translation
network". A translation network joins models of the two languages
involved, inspired by the aspectual networks of Moens
(1987), and attempts to cater, in a systematic way, for many situations that
occur in actual translation. This chapter is divided into four main sections.
In Section 1 a brief description of many problems that are not addressed by
simpler models of translation is presented. Section 2 represents the core of
this chapter, describing how translation networks are composed. This
description is illuminated with examples from literary translations between
English and Portuguese. Section 3 critically reviews some problems with the
model, before Section 4 concludes with a short defense
of it.
Parallel text alignment using crosslingual
information retrieval techniques
Christian Fluhr, Fréderique Bisson and Faïza Elkateb (CEA/DIST,
Keywords: Cross-language information retrieval, weighted boolean model, sentence alignment, word alignment, bilingual corpora, French, English
Abstract: In this chapter, we demonstrate that aligning a sentence
with its translation is not fundamentally different from finding a sentence on
the same topic in the target corpus, using the source sentence as a query. The
two processes are based on the semantic proximity of two sentences in different
languages, and their major difference is that information retrieval only needs
to insure that the sentence found contains most of the information of the
query, whereas sentence alignment requires that the parts that are not common
to both languages be as small as possible. A crosslingual
query system can be used to obtain candidates for sentence alignment, provided
that the measure of semantic proximity slightly modified. More classical techniques
can be used, taking sequential order into account, but our approach is very
robust to text desynchronization, such as missing
text segments in one language, or texts such as glossaries or indexes that are
not in the same order in different languages.
Parallel alignment of structured documents
Laurent Romary and Patrice Bonhomme (
Keywords: SGML, XML-structured documents, structural alignment, multi-level alignment, TEI.
Abstract: Classical methods for parallel text alignment consider one specific level (e.g. sentences) at which two or more versions of a text are synchronised. This may lead to some problems when these documents are particularly long since alignment errors at some point in the text may, in the absence of any other linguistic information, propagate for some time without any chance of recovery. In this chapter we consider how multilingual parallel alignment can be based on the fact that more and more texts are now highly structured by means of tagging languages such as SGML. In particular we will describe recent efforts in multi-level alignment for which we will present the main advances as well as some of the difficulties to be dealt with, particularly when the text and its translation contain different encoding schemes or different encoding practices for the same scheme.
A statistical view on
bilingual lexicon extraction: From parallel corpora
to non-parallel corpora
Pascale Fung (
Keywords: Parallel corpora, bilingual lexicon extraction, Chinese, English
Abstract: We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method—Convec. Convec is based on context information of a word to be translated. Even though the accuracy for top translation candidate is about 30% for 3 months of English and Chinese newspaper material, we show a dramatic increase of accuracy when we use a larger evaluation corpus in English and French. We find a 75% precision for the top three candidate translation of 75 content words, on English Wall Street Journal and French European News from different years.
Terminology extraction from parallel
technical texts
Ingeborg Blank (
Keywords: Multilingual corpora, terminology extraction, lexical knowledge, alignment, French, German
Abstract: This chapter deals with the processing of a multilingual corpus of technical texts (patent documentation). As the relevant knowledge contained in such texts is concentrated in technical terms, the aim of the study is to extract special purpose terminology. A semi-automatic tool has been developed to help knowledge engineers, terminologists and professional translators not only to identify technical terms but also to detect possible translation equivalences and typical contexts of terms. Fully automatic bilingual term matching is not attempted. Language-specific terminology is defined by criteria suitable for an automatic procedure. Related studies in multilingual terminology extraction are also considered and the assumptions underlying these studies are examined on the corpus.
Term alignment in use:
Machine-aided human translation
Éric Gaussier,
David Hull and Salah Aït-Mokhtar
Research Centre Europe,
Keywords: Machine-aided human translation, translation memory, word alignment, terminology extraction, terminology alignment, English, French
Abstract: This chapter will describe how parallel text extraction algorithms can be used for machine aided translation, focusing on two particular applications: semi-automatic construction of bilingual terminology lexicons and translation memory. Automatic word alignment and terminology extraction algorithms can be combined to substantially speed the lexicon construction process. Using a highly accurate partial alignment of term constituents, a terminologist need only recognize and correct minor errors in the recognition of term boundaries. The next generation of translation memory systems will certainly use statistical alignment algorithms and shallow parsing technology to improve coverage of current systems, by allowing for linguistic abstraction and partial sentence matching. Abstracting away from lexical units to part-of-speech, number, term, or noun phrase classes will allow these systems to mix and match components.
Automatic dictionary extraction for cross-language information retrieval
Ralf D. Brown, Jaime
G. Carbonell and Yiming
Yang (Carnegie Mellon
University Language Technologies Institute,
Keywords: Bilingual dictionary extraction, cross-language information retrieval, Generalized Vector Space Model, Latent Semantic Indexing
Abstract: In experiments comparing a variety of different methods for cross-language information retrieval using a bilingual training corpus—methods based on both machine translation and "traditional" information-retrieval techniques—a fairly simple statistical technique for automatically extracting a bilingual dictionary from parallel text proved to have the best performance. Surprisingly, an improvement to the dictionary extraction method that significantly increases the accuracy of the dictionary proved to be slightly detrimental to overall performance even though it is highly beneficial for other applications. This chapter will describe the extraction method and its enhancement in detail, and compare the performance of a retrieval system using the automatically-generated dictionaries with other retrieval methods.
Parallel texts in
computer-assisted language learning
John Nerbonne (
Keywords: Language learning, computer-assisted language learning, computer-aided instruction, vocabulary
Abstract: Parallel bilingual texts are a valuable source of
information to advanced language learners, particularly in the area of lexis,
subtle lexical dependencies. Typically this information is either not available
or sporadically available only in very large dictionaries. To be most
effective, the corpora in question should be indexed by lexeme (not string, or
word form), and should be aligned into parallel sentences. This paper surveys
use and prospects.
aligned bilingual corpora
Hitoshi Isahara and Masahiko Haruno (Communications
Keywords: Corpus development, parallel corpus, automatic sentence alignment, Japanese, English
Abstract: This chapter describes the bilingual corpora developed in
Building a parallel
corpus of English/Panjabi
Sukhdave Singh, Tony McEnery and Paul Baker (
Keywords: Corpus
building, parallel corpus construction, writing system variation, scarcity of electronic text resources, TEI encoding, sentence
alignment, Panjabi, English
Abstract: In this chapter we will be concerned primarily with the
development of new parallel corpora, specifically for English paired with Indic
languages. The focus of our discussion here will be Panjabi, though the issues
we explore apply fairly equally to other Indie languages and scripts. We want
to highlight a range of difficulties which face those constructing parallel
corpus resources for the exploration of these languages, especially in the
context of parallel corpora. In order to do this, two corpora—one of 16th
century Panjabi and one of modern Panjabi—will be described, and some
preliminary work on English/Panjabi alignment briefly presented.
Sharing of translation memory databases derived from aligned parallel text
Alan K. Melby (Brigham
Young University Translation Research Group,
Keywords: TMX, localization, translation memory, XML, data interchange, standards, OSCAR, LISA
Abstract: Translation memory databases are used in order to avoid
unnecessary re-translation of previously translated segments of text by
automatic lookup and retrieval. Various commercial and in-house translation
memory lookup tools derive translation memory databases from aligned parallel texts,
but each tool uses a different internal representation for its translation
memory database. In response to end-user requests, several developers of
translation memory lookup tools have co-operated to define a standard
intermediate format for exchanging translation memory databases from one
translation technology application to another. This intermediate format, called
TMX (Translation Memory eXchange), is an XML application
and is thus platform independent and inspectable
using a text editor. TMX can also be used as an intermediate format for aligned
parallel texts in general, supporting reconstruction of original texts and
optional separation of text and markup thanks to
meta-markup tags. TMX was developed by OSCAR (Open Standards
for Container/content Allowing Re-use), which is the data exchange standards
group of LISA (Localization Industry Standards Association). The chapter
describes translation memory databases, explains how they are used in the translation
industry, and comments on the standard itself.
Evaluation of
parallel text alignment systems: The
Jean Véronis and Philippe Langlais
Keywords: Parallel corpus, evaluation, sentence alignment, word alignment, English, French
Abstract: This chapter describes the