Using partly multilingual patents to support research on multilingual IR by building translation memories and MT systems

Using partly multilingual patents to support research on multilingual IR by building translation memories and MT systems

Lingxiao Wang¹, Christian Boitet², Mathieu Mangeot³
¹GETALP, Laboratoire d'Informatique de Grenoble, ²UJF, Grenoble 1 (LIG-GETALP), ³GETALP-LIG Laboratory, Grenoble University

Abstract

In this paper, we describe the extraction of directional translation memories (TMs)from a partly multilingual corpus of patent documents, namely the CLEF-IP collection and the subsequent production and gradual improvement of MT systems for the associated sublanguages (one for each language), the motivation being to support the work of researchers of the MUMIA community. First, we analysed the structure of patent documents in this collection, and extracted multilingual parallel segments (English-German, English-French, and French-German) from it, taking care to identify the source language, as well as monolingual segments. Then we used the extracted TMs to construit statistical machine translation systems (SMT). In order to get more parallel segments, we also imported monolingual segments into our post-editing system, and post-edited them with the help of SMT.