The 5th Workshop on

Building and Using Comparable Corpora

Special Theme: “Language Resources for

Machine Translation in Less-Resourced Languages and Domains”

LREC2012 Workshop

26 May 2012

Istanbul, Turkey


Table of Contents:


Reinhard Rapp, Marko Tadić, Serge Sharoff, Pierre Zweigenbaum

Preface vii


Philipp Petrenz, Bonnie Webber

Robust Cross-Lingual Genre Classification through Comparable Corpora 1


Qian Yu, François Yvon, Aurélien Max

Revisiting sentence alignment algorithms for alignment visualization and evaluation 10


Inguna Skadiņa

Analysis and Evaluation of Comparable Corpora for Under-Resourced Areas of Machine Translation 17


Andrejs Vasiļjevs

LetsMT! – Platform to Drive Development and Application of Statistical Machine Translation 20


Núria Bel, Vassilis Papavasiliou, Prokopis Prokopidis, Antonio Toral, Victoria Arranz

Mining and Exploiting Domain-Specific Corpora in the PANACEA Platform 24


Adam Kilgarriff, George Tambouratzis

The PRESEMT Project 27


Béatrice Daille

Building bilingual terminologies from comparable corpora: The TTC TermSuite 29


Aimée Lahaussois, Séverine Guillaume

A viewing and processing tool for the analysis of a comparable corpus of Kiranti mythology 33


Nancy Ide

MultiMASC: An Open Linguistic Infrastructure for Language Research 42


Elena Irimia

Experimenting with Extracting Lexical Dictionaries from Comparable Corpora for English-Romanian language pair 49


Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov

Romanian Translational Corpora: Building Comparable Corpora for Translation Studies 56


Angelina Ivanova

Evaluation of a Bilingual Dictionary Extracted from Wikipedia 62


Quoc Hung-Ngo, Werner Winiwarter

A Visualizing Annotation Tool for Semi-Automatical Building a Bilingual Corpus 67


Lene Offersgaard, Dorte Haltrup Hansen

SMT systems for less-resourced languages based on domain-specific data 75


Magdalena Plamada, Martin Volk

Towards a Wikipedia-extracted Alpine Corpus 81


Sanja Štajner, Ruslan Mitkov

Using Comparable Corpora to Track Diachronic and Synchronic Changes in Lexical Density and Lexical Richness 88


Dan Ştefănescu

Mining for Term Translations in Comparable Corpora 98


George Tambouratzis, Michalis Troullinos, Sokratis Sofianopoulos, Marina Vassiliou

Accurate phrase alignment in a bilingual corpus for EBMT systems 104


Kateřina Veselovská, Ngăy Giang Linh, Michal Novák

Using Czech-English Parallel Corpora in Automatic Identification of It 112


Manuela Yapomo, Gloria Corpas, Ruslan Mitkov

CLIR- and ontology-based approach for bilingual extraction of comparable documents 121


Amir Hazem, Emmanuel Morin

ICA for Bilingual Lexicon Extraction from Comparable Corpora 126


Hiroyuki Kaji, Takashi Tsunakawa, Yoshihoro Komatsubara

Improving Compositional Translation with Comparable Corpora 134


Nikola Ljubešić, Špela Vintar, Darja Fišer

Multi-word term extraction from comparable corpora by combining contextual and constituent clues 143


Robert Remus, Mathias Bank

Textual Characteristics of Different-sized Corpora  148