First International Joint Conference on Natural Language Processing

Hainan Island, China, March 22-24, 2004

Revised selected papers

Berlin/Heidelberg: Springer, 2005

(Lecture Notes in Computer Science, vol.3248; ISSN: 0302-9743)

ISBN: 978-3-540-24475-2


Selected abstracts

[For copyright reasons these papers cannot be reproduced in full in the archive. Go to:]


pp.110-119: Phoneme-based transliteration of foreign names for OOV problem – Wei Gao, Kam-Fai Wong and Wai Lam (Chinese University of Hong Kong)

            A proper noun dictionary is never complete rendering name translation from English to Chinese ineffective. One way to solve this problem is not to rely on a dictionary alone but to adopt automatic translation according to pronunciation similarities, i.e. to map phonemes comprising an English name to the phonetic representations of the corresponding Chinese name. This process is called transliteration. We present a statistical transliteration method. An efficient algorithm for aligning phoneme chunks is described. Unlike rule-based approaches, our method is data-driven. Compared to source-channel based statistical approaches, we adopt a direct transliteration model, i.e. the direction of probabilistic estimation conforms to the transliteration direction. We demonstrate comparable performance to source-channel based system.


pp.168-176: Building a parallel bilingual syntactically annotated corpus – Jan Cuřin, Martin Čmejrek, Jiří Havelka and Vladislav Kuboň (Charles University Prague)

            This paper describes a process of building a bilingual syntactically annotated corpus, the PCEDT (Prague Czech-English Dependency Treebank). The corpus is being created at Charles University, Prague, and the release of this corpus as Linguistic Data Consortium data collection is scheduled for the spring of 2004. The paper discusses important decisions made prior to the start of the project and gives an overview of all kinds of resources included in the PCEDT.


pp.177-186: Acquiring bilingual named entity translations from content-aligned corpora – Tadashi Kumano, Hideki Kashioka, Hideki Tanaka and Takahiro Fukusima (ATR, NHK, Otemon Gakuin University)

            We propose a new method for acquiring bilingual named entity (NE) translations from non-literal, content-aligned corpora. It first recognizes NEs in each of a bilingual document pair using the NE extraction technique, then finds NE groups whose members share the same referent, and finally corresponds between bilingual NE groups. The exhaustive detection of NEs can potentially acquire translation pairs with broad coverage. The correspondences between bilingual NE groups are estimated based on the similarity of the appearance order in each document, and the corresponding performance came up to F(=1) = 71.0% by using small bilingual dictionary together. The total performance for acquiring bilingual NE pairs through the overall process of extraction, grouping, and corresponding was F(=1) = 58.8%.


pp.206-215: Example-based machine translation without saying inferable predicate – Eiji Aramaki, Sadao Kurohashi, Hideki Kashioka and Hideki Tanaka (University of Tokyo, ATR, NHK)

            For natural translations, a human being does not express predicates that are inferable from the context in a target language. This paper proposes a method of machine translation which handles these predicates. First, to investigate how to translate them, we build a corpus in which predicate correspondences are annotated manually. Then, we observe the corpus, and find alignment patterns including these predicates. In our experimental results, the machine translation system using the patterns demonstrated the basic feasibility of our approach.


pp.216-223: Improving back-transliteration by combining information sources – Slaven Bilac and Hozumi Tanaka (Tokyo Institute of Technology)

            Transliterating words and names from one language to another is a frequent and highly productive phenomenon. Transliteration is information loosing since important distinctions are not preserved in the process. Hence, automatically converting transliterated words back into their original form is a real challenge. However, due to wide applicability in MT and CLIR, it is a computationally interesting problem. Previously proposed back-transliteration methods are based either on phoneme modeling or grapheme modeling across languages. In this paper, we propose a new method, combining the two models in order to enhance the back–transliterations of words transliterated in Japanese. Our experiments show that the resulting system outperforms single-model systems.


pp.224-232: Bilingual sentence alignment based on punctuation statistics and lexicon – Thomas C.Chuang, Jian-Cheng Wu, Tracy Lin, Wen-chie Shei and Jason S.Chang (Vanung University, National Tsing Hua University, National Chiao Tung University)

            This paper presents a new method of aligning bilingual parallel texts based on punctuation statistics and lexical information. It is demonstrated that the punctuation statistics prove to be effective means to achieve good results. The task of sentence alignment of bilingual texts written in disparate language pairs like English and Chinese is reportedly more difficult. We examine the feasibility of using punctuations for high accuracy sentence alignment. Encouraging precision rate is demonstrated in aligning sentences in bilingual parallel corpora based solely on punctuation statistics. Improved results were obtained when both punctuation statistics and lexical information were employed. We have experimented with an implementation of the proposed method on the parallel corpora of Sinorama Magazine and Records of the Hong Kong Legislative Council with satisfactory results.


pp.233-243: Automatic learning of parallel dependency treelet pairs – Yuan Ding and Martha Palmer (University of Pennsylvania)

            Induction of synchronous grammars from empirical data has long been an unsolved problem; despite generative synchronous grammars theoretically suit the machine translation task very well. This fact is mainly due to pervasive structural divergences between languages. This paper presents a statistical approach that learns dependency structure mappings from parallel corpora. The new algorithm automatically learns parallel dependency treelet pairs from loosely matched non-isomorphic dependency trees while keeping computational complexity polynomial in the length of the sentences. A set of heuristics is introduced and specifically optimized for parallel treelet learning purposes using Minimum Error Rate training.


pp.244-253: Practical translation pattern acquisition from combined language resources – Mihoko Kitamura and Yuji Matsumoto (Nara Institute of Science and Technology, Oki Electric Industry Co.Ltd.)

            Automatic extraction of translation patterns from parallel corpora is an efficient way to automatically develop translation dictionaries, and therefore various approaches have been proposed. This paper presents a practical translation pattern extraction method that greedily extracts translation patterns based on co-occurrence of English and Japanese word sequences, which can also be effectively combined with manual confirmation and linguistic resources, such as chunking information and translation dictionaries. Use of these extra linguistic resources enables it to acquire results of higher precision and broader coverage regardless of the amount of documents.


pp.254-262: An English-Hindi statistical machine translation system – Raghavendra Udupa U. and Tanveer A.Faruquie (IBM India Research Lab)

            Recently statistical methods for natural language translation have become popular and found reasonable success. In this paper we describe an English-Hindi statistical machine translation system. Our machine translation system is based on IBM Models 1, 2, and 3. We present experimental results on an English-Hindi parallel corpus consisting of 150,000 sentence pairs. We propose two new algorithms for the transfer of fertility parameters from Model 2 to Model 3. Our algorithms have a worst case time complexity of O(m3) improving on the exponential time algorithm proposed in the classical paper on IBM Models. When the maximum fertility of a word is small, our algorithms are O(m2) and hence very efficient in practice.


pp.280-289: Natural language database access using semi-automatically constructed translation knowledge – In-Su Kang, Jae-Hak J.Bae and Jong-Hyeok Lee (POSTECH, University of Ulsan)

            In most natural language database interfaces (NLDBI), translation knowledge acquisition heavily depends on human specialties, consequently undermining domain portability. This paper attempts to semi-automatically construct translation knowledge by introducing a physical Entity-Relationship schema, and by simplifying translation knowledge structures. Based on this semi-automatically produced translation knowledge, a noun translation method is proposed in order to resolve NLDBI translation ambiguities.


pp.358-366: Influence of WSD on cross-language information retrieval – In-Su Kang, Seung-Hoon Na and Jong-Hyeok Lee (POSTECH)

Translation ambiguity is a major problem in dictionary-based cross-language information retrieval. This paper proposes a statistical word sense disambiguation (WSD) approach for translation ambiguity resolution. Then, with respect to CLIR effectiveness, the pure effect of a disambiguation module will be explored on the following issues: contribution of disambiguation weight to target term weighting, influences of WSD performance on CLIR retrieval effectiveness. In our investigation, we do not use pre-translation or post-translation methods to exclude any mixing effects on CLIR.


pp.416-425: Bilingual chunk alignment based on interactional matching and probabilistic latent semantic indexing – Feifan Liu, Qianli Jin, Jun Zhao and Bo Xu ( Institute of Automation, Chinese Academy of Sciences)

            An integrated method for bilingual chunk partition and alignment, called Interactional Matching, is proposed in this paper. Different from former works, our method tries to get as necessary information as possible from the bilingual corpora themselves, and through bilingual constraint it can automatically build one-to-one chunk-pairs associated with the chunk-pair confidence coefficients. Also, our method partitions bilingual sentences entirely into chunks with no fragments left, different from collocation extracting methods. Furthermore, with the technology of Probabilistic Latent Semantic Indexing (PLSI), this method can deal with not only compositional chunks, but also non-compositional ones. The experiments show that, for overall process (including partition and alignment), our method can obtain 85% precision with 57% recall for the written language chunk-pairs and 78% precision with 53% recall for the spoken language chunk-pairs.