TSD 2012, September 3-7,
Word Translation Disambiguation is the task of selecting the best translation(s) for a source word in a certain context, given a set of translation candidates. Most approaches to this problem rely on large word-aligned parallel corpora, resources that are scarce and expensive to build. In contrast, the method presented in this paper requires only large monolingual corpora to build vector space models encoding sentence-level contexts of translation candidates as feature vectors in high-dimensional word space. Experimental evaluation shows positive contributions of the models to overall quality in German-English translation.
We describe the structure of a space-efficient phrase table for phrase-based statistical machine translation with the Moses decoder. The new phrase table can be used in-memory or be partially mapped on-disk. Compared to the standard Moses on-disk phrase table implementation a size reduction by a factor of 6 is achieved. The focus of this work lies on the source phrase index which is implemented using minimal perfect hash functions. Two methods are discussed that reduce the memory consumption of a baseline implementation.
In this paper, we explore several methods of improving the estimation of translation model probabilities for phrase-based statistical machine translation given in-domain data sparsity. We introduce a hierarchical variant of maximum a posteriori (MAP) adaptation for domain adaptation with an arbitrary number of out-of-domain models. We note that domain adaptation can have a smoothing effect, and we explore the interaction between smoothing and the incorporation of out-of-domain data. We find that the relative contributions of smoothing and interpolation depend on the datasets used. For both the IWSLT 2011 and WMT 2011 English-French datasets, the MAP adaptation method we present improves on a baseline system by 1.5+ BLEU points.
This paper presents some problems involved in the machine translation of proper names (PNs) from English into Vietnamese. Based on the building of an English-Vietnamese comparable corpus of texts with numerous PNs extracted from online BBC News and translated by four machine translation (MT) systems, we implement the PN error classification and analysis. Some pre-processing solutions for reducing and limiting errors are also proposed and tested with a manually annotated corpus in order to significantly improve the MT quality.