[Extracted from: www.meta-net.eu/events/meta-forum-2012/index.html]
19-21 June 2012,
More than 200
participants from research, various industries and politics. 57 speaker
contributions. 12 award winners. Two days of intense discussions about the
current state and future of language technology in
META-FORUM 2012 was organised by META-NET, a Network of Excellence consisting of 60 research centres from 34 countries. META-NET is dedicated to building the technological foundations of a multilingual European information society. META-NET is forging META, the Multilingual Europe Technology Alliance, with currently more than 640 members.
This year's META-FORUM was the third edition, after two successful events in Brussels 2010 and Budapest 2011. The timing of META-FORUM 2012 was ideal: currently, there is a lot of discussion about topics in upcoming, long-term research programmes. The META-FORUM event, the Language White Paper series and the Strategic Research Agenda (SRA) developed within META-NET, all aim at presenting a very strong message from the language technology community.
In what follows, after a summary of the key messages, you will find short descriptions of all presentations. Links are provided to videos of the presentations and to the slides used. We recommend also watching the videos: they are short and contain much more detail.
What follows is an analysis and synthesis of ideas brought out during META-FORUM 2012. It is very high level, and you should watch the presentations to get a better understanding of the points made.
opened the event with a description of the challenges the language technology
The Europe and
its languages session discussed the current situation for
The Industry and Innovation session was opened by Serge Gladkoff. He presented the viewpoint of GALA on the efforts of META-NET and the needs of language service providers. Tomás Pariente brought up the topic of (big) open data. Language technology can support the creation of high quality big data and provide useful means for its consumption. Radu Soricut presented a view on machine translation from an industry perspective: machine translation needs to be part of many industrial ecosystems and big data sources like the Web. George Wright introduced the BBC World Service archive and how speech technology is used to foster access to thousands of radio programmes. Florence Beaujard showed how Boeing is using controlled language to assure high quality in the life critical area of cockpit design. Lori Thicke discussed the future of machine translation. The connection with other technologies like controlled languages and the integration into industrial workflows can help to achieve better quality and to bootstrap the creation of new machine translation systems. A short discussion touched, among others, on the availability of language resources and issues that need to be resolved with regards to licensing.
Language Technologies for Europe 2020 was a joint session organised by META-NET and LT Innovate. Georg Rehm introduced the approach of META-NET towards the topic, followed and complemented by Jochen Hummel and LT-Innovate. Hans Uszkoreit summarized the META-NET Strategic Research Agenda (SRA). The SRA encompasses three priority themes: “Translation Cloud”, presented by Andrejs Vasiljevs, “Social Intelligence and eParticipation”, presented by Marko Grobelnik, and “Socially Aware Interactive Assistants”, presented by Joseph Mariani. Seven key participants from LT-Innovate reported on the current state of the LT-Innovate innovation themes. These themes are an important input for framing how “innovation” is described within the SRA. The final joint discussion focused on the needs of end users for language technology.
The first day closed with a fireworks display of twelve winners: eleven META Seals of Recognition were given to products in various areas of language technology: information extraction (Jakub Zavrel); speech processing (Alessandro Tescari, Siegfried Kunzmann, Radim Kudla and Joseph Mariani); machine translation (Heidi Depraetere, Bernardo Magnini, Radu Soricut, Kirti Vashee and Dion Wiggins and Tony O'Dowd); and basic tools for natural language processing (Lauri Karttunen). The META Prize was awarded to the JRC Optima Activity for Language Technology, represented by Erik van der Goot. JRC develops the Europe Media Monitor. EMM is a language technology enabled service that gathers information from news portals in 60 languages and classifies the articles, analyses news texts, issues alerts and produces visual presentations of the news.
The second day started with an overview of short and long term opportunities for LT Research and Innovation on the European Level. Kimmo Rossi introduced two new programmes that are currently being planned: Horizon 2020 and Connecting Europe Facility (CEF). Roberto Cencioni provided details about topics in current calls for project proposals.
The topic of exchanging and re-using language resources was the primary focus of the META-SHARE session. META-SHARE is an open infrastructure to foster LR/LT sharing and re-use. Stelios Piperidis described general aspects of META-SHARE, related, e.g., to licensing aspects. The following three presenters focused on contributions to META-SHARE: Tamás Varadi mostly for Slavic languages, Antonio Branco for south European languages, and Andrejs Vasiljevs for north European languages, with a specific focus on smaller languages. In the final discussion it became clear that, given this input, META-SHARE is already populated with many resources that are useful for research and industry. Now the sustainability of META-SHARE needs to be assured, and the availability of resources with adequate licenses.
The final panel session on LT research in EU member states and regions complemented the previous session on support on the European level. The situation was explained for the countries Hungary (Károly Gruber), Bulgaria (Diana Popova), Czech Republic (Karel Oliva), France (Edouard Geoffrois), the Netherlands (Alice Dijkstra), and Slovenia (Simona Bergoč). Joseph Mariani explained various political instruments to support joint research between member states. The panel discussion focused on benefits of coordinated programmes on the European level and methods to create these.
The keynote lecture of META-FORUM 2012 was given by Fernando Pereira. He talked about language technology efforts at Google. Here the focus is on language technology workflows that scale to the Web and inter-relate external knowledge bases with Web content. The good news is that language technology has its role in this workflow and achieves better results than the simple matching of patterns in texts; however, to be able to compete in such industrial scenarios, language technology must be robust and scalable.
Hans Uszkoreit gave a short closing presentation, coming back to the challenges for the LT community mentioned by Thibaut Kleiner. META-FORUM 2012 presented impressive results that European language technology has achieved so far, and issues that need to be addressed in the coming months. The community is in good shape to deal with these challenges and will present the outcomes at META-FORUM 2013.
Kleiner, Member of the Cabinet of Neelie Kroes, Commissioner for the Digital
Agenda and Vice-President of the European Commission, EC, opened the conference
with a presentation entitled “Technological Challenges of the Multilingual
European Society”. He gave a warning message to the language technology
community: future funding for language technology in
technology can help to master the fast amount of information on the Web in
various languages with applications like news and opinion mining or business
intelligence. Such application areas for language technology are obvious, the
question is mostly who will take the lead –
Other communities, e.g. around open data, have managed to get interest from policy makers and lead to output in many SMEs – language technology too needs to have a strong voice on the policy maker level.
Uszkoreit, from DFKI (
The three major challenges for language technology are to preserve multilingual diversity, to secure cross-lingual flow of information, and to give means for communication, information and knowledge management to all language communities. In META-NET, the European language technology community has worked for more than two years in three lines of action to address these challenges:
META-FORUM 2012 presents various outcomes and the current state of this work, like the language white paper series, the META-VISION process involving about 100 experts and leading to a draft SRA, and the META-SHARE repositories covering more than 1300 language resources.
bridge to the opening keynote from Thibaut Kleiner, Hans Uszkoreit reminded us
that the decision about future funding for language technology in
Saudargas, Member of the European Parliament from
It is important to translate this “message” to the political language, so that it will be take up outside the language technology community. Only in this way we will be able to convince policy makers. In this conversation, language technology should not be described as science fiction that puts machines between humans, but as a means to support communication between humans.
Kornai from the
The presentation took a look at Wikipedia. An overall analysis shows that only a small percentage of languages are in the comfort zone. Many languages are vital in terms of speakers, but not represented well in the digital world. We need enabler projects for building basic tools for these languages and also for “heritage languages”, so that they can achieve a passive Web presence of lexicons, classical literature etc.
A specific effort was put into cross-lingual ranking of language technology support, with broad categories like “excellent”, “good”, “moderate”, “fragmentary”, and “weak or no support”. The results vary with regards to the technology area. For example, support for voice technologies is slightly better than machine translation. Nevertheless “excellent” support is not given for any language, and even the level “good” is rarely available for languages other than English. The language white papers have detected major gaps in terms of language technology support for each European language that need to be addressed in the near future, to assure competiveness of the languages in the digital market.
Panel with Representatives of the EFNIL Language Communities
Gerhard Stickel, Institute for the German Language and EFNIL (European Federation of National Institutions for Language) president, opened the panel. He introduced EFNIL as an organization, its history, the EFNIL conference series and collaborations with META-NET. META-NET can contribute new ICT related developments to EFNIL; EFNIL provides needs and use cases for language technology from the perspectives of language research, planning and teaching. Progress in language technology still has to be made, but there are more and more applications of language technology in EFNIL related areas. In the future this might lead to more exchange of knowledge between the fields, but maybe also to concrete, joint projects.
Arnfinn Muruvik Vonen , from the Language
Council of Norway, acknowledged that local politicians in
Ray Fabri, National Council for the Maltese Language,
described the complex bilingual situation in
Peter Spyns, De Nederlandse Taalunie,
referred back to the Wikipedia analysis made by András Kornai, saying that
Dutch has a good position in the digital age. In
Arvi Tavast, Institute of the Estonian Language, congratulated the authors of the white paper series to their results. He added a small warning with regards to the message made for smaller languages: politicians look into (economic) outcomes of language technology research. To make these visible, projects are necessary that put the languages into the “comfort zone”, in the sense of András Kornai. For these projects and in general, a copyright law is needed that eases the re-use of language resources.
Algirdas Saudargas, Member of the European
During the panel discussion, potential additional input to the analysis of language technology support was discussed. This includes e.g. insights from professional translators. Additional languages like sign languages have to be taken into account. From an end user perspective, a need like “we want good quality speech recognition” was articulated. We need to make clear also to policy makers that such a request involves the development of the underlying language technology “food chain”, including lots of components for e.g. morphological and phonological analysis.
Relating the Wikipedia analysis from András Kornai to the panel, most official European languages were categorized as being “vital” in the digital world. However, to achieve wide uptake of language technology, the technology also has to be marketed in the right manner, e.g. via “cool, easy to understand Apps”. Finally, it was emphasized that a European copyright law that eases re-use of language resources for research purposes is deeply needed.
Gladkoff, GALA Standards Director and GALA Board member, President of Logrus
International Cooperation, opened the “Industry and Innovation – Language
Technology made in
GALA is the
umbrella for the language service providers and users across
Pariente, Atos Research & Innovation,
In these and other areas, more and more unstructured data has to be processed. Atos is involved in projects that deal with such information in specific domains, e.g. finance. This is, however, only one area of big data: in the BIG project, the aim is to analyse big data from various perspectives, e.g. technology, business and policy, and in many different domains: health, public sector, finance & insurance, media & entertainment, and manufacturing, retail, energy, transport. BIG’s main idea is to gather the relevant stakeholders; the language technology community now has the opportunity to be recognized as part of them and to contribute ideas and methodologies for handling big data.
Radu Soricut, manager of application science & engineering and senior research scientist, SDL International, gave a talk entitled “Changing the Perspective on Machine Translation”. In the past, the MT community was concerned with the MT technology itself, e.g. approaches towards MT (e.g. statistical MT vs. rules based MT), integration e.g. with translation memories etc. However, the end customer mostly cares about the value in a customer specific eco system. Hence, MT needs to be part of many infrastructures including e.g. CMS or ERP systems, and needs to be able to make use of the large data source “the Web”.
With MT in large industrial ecosystems, exploitation of massive parallel data available in translation memories and connected to the Web becomes possible. MT systems easily can be tailored to customer relevant domains. MT engines can take various information sources like user feedback into account. Challenges for the future of industry strength MT include scalability of customization, adaptation to customers and automatic learning from user feedback.
George Wright, head of the Internet Research & Future Services Team, BBC Research, gave a presentation entitled “Speech analysis and Archive research at the BBC”. The background is the BBC World Service archive which covers 70.000 radio programmes, but has only sparse metadata available for accessing it. Language technology can help to (re-)categorize the content and create links between content items and the Web.
As a result, the system e.g. builds suggestions about topics covered in a programme or identifies separate speakers. The accuracy of the results is still a challenge. One issue is the availability of adequate language resources, e.g. tools that can handle British English adequately. The development of these and other language resources must be supported, so that the fast volumes of multimedia content will be accessible for a global audience.
Florence Beaujard, head of Linguistics and Physiology Group, Airbus, gave a talk entitled “Linguistic Activities of Airbus Design Office”. In cockpit design, the special purpose language of pilots and many other constraints like size of displays have to be taken into account to create clear messages and labels. This is why Boeing has defined a controlled language: it helps to reduce potential ambiguities, and to improve text comprehensibility by non-native English speakers.
There are some general principles like “one word, one meaning” or “one meaning, one word order” underlying the controlled language. In addition, there are lexica and rules how to write messages or labels. Collaboration with pilots and instructors is crucial for the development of the controlled language. Outcomes so far are various tools e.g. to extract display text from the designers specification, and to automatically check its adequacy as a message or label. A desire for the future is to ease specification writing for system designers, via dedicated controlled language(s) to guide the designers.
Lori Thicke, CEO, Lexcelera Localization and representative of Translators without Borders, talked about “Why Do We Need Language Technology”. Language technology is needed to deal with a contradictory situation: more and more content has to be translated faster, with demand for more quality, and on lower costs. Translation also plays a societal role: e.g. access to translated information in developing countries can be critical even for survival. Language technology like machine translation also can help to resolve the mismatch of digital content available and number of speakers in the developing countries.
For the future of machine translation, it is important to see the technology as a process, including pre-production, the actual processing, post-editing etc. Quality in the source content is key to deliver quality MT. The ACCEPT project is dedicated to develop controlled language rules, which will help management of content in social forums and finally better quality machine translation. Work areas for the future of MT include post-editing, terminology control and the integration of MT with translation memories.
The discussion touched on the re-occurring issue of copyright and language resources. Both the corpus created in the “Translators without Borders” project and the archives created by the BBC are valuable resources for research purposes. But they can only be re-used if the thin, but important line between distributing resources freely and making them available for research is drawn.
Controlled language was discussed also in terms of re-use. The presentation from Florence Beaujard demonstrated that a concrete controlled language is quite specific to application scenarios. Nevertheless there is the opportunity to re-use controlled language resources, e.g. criteria to reduce synonymous, rules to create acronyms or to generate abbreviations etc. This could be achieved via creating a standardized specification for some aspects of controlled language.
Machine translation is facing questions like what metrics to use for its evaluation. It was proposed that the same metrics should be used like in human translation, e.g. the LISA quality metrics. An issue that has no general solution is machine translation for languages with a limit amount of language resources. There is no silver bullet to solve this problem; at the end, human translators need to create the resources.
META-NET and LT-Innovate started this session with a joint slot.
Georg Rehm, META-NET and DFKI, gave a presentation entitled “Introduction and Presentation of Partnership”. After a short history of META-NET, the focus was on the META- VISION line of action for “building a community with a shared vision and strategic research agenda”: as of mid 2012, META-NET has 60 members in 34 countries. Collaboration agreements have been created with 46 other EU-funded projects.
Jochen Hummel, ESTeam and chairman of LT-Innovate, gave an “Introduction to LT Innovate”. LT-Innovate aims at promoting European language technology, unifying the industrial community, and to articulate itself towards investors and policy makers.
Language Technology is the missing piece in the puzzle of the digital single market. LT-Innovate is creating an innovation agenda to fill this gap. This agenda complements the META-NET strategic research agenda (SRA), with the aim to foster adoption of research results in the market. About 150 people from the language technology industry participated in the LT-Innovate summit that took place just before META-FORUM. They discussed the “innovation agenda”, showcased their language technology applications, and demonstrated a strong voice of the European LT industry.
Hans Uszkoreit, DFKI and META-NET, gave a presentation entitled “The META-NET Strategic Research Agenda: Overview, Preparation, Dissemination”. Creating the Strategic Research Agenda (SRA) is one main task of META-NET. In the SRA, on the basis of the state of IT technology, a broad vision for the year 2020 and various strategic considerations, three interconnected priority themes have been developed. These will be accompanied by an innovation model, to be developed in close collaboration with LT-Innovate.
Various new topics will influence the SRA: big data, services & cloud computing, and shared infrastructures. Language technologies are prime candidates for “sky computing”, a new area that encompasses the federation of several clouds for creating complex services. A sky computing based, European language technology service platform can be the basis for uniting LT providers, language service providers, researchers, and providers of other services, citizens and corporate users.
Andrejs Vasiljevs, Tilde, presented the SRA priority theme “Translation Cloud”. Many applications needed by EU citizens and businesses require specific or generic translation services: eCommerce, cross-language subtitling, education etc. The translation cloud will be a ubiquitous online platform to provide these services, including various methods like machine translation or automatic language checking, for usage in and delivery to many devices. This will have huge impact, like facilitating job opportunities and creating new business opportunities in the huge global market of language services.
The current state is promising: more data and tooling for machine translation is available. Nevertheless, we still need a research breakthrough in areas like high quality MT, and research needs to be organized with close integration to the industry.
Marko Grobelnik, Institut “Jožef Stefan”, presented the SRA priority theme “Social Intelligence and eParticipation”. He started with a review of various trends, like the importance of language related technologies in the Gartner hype cycle, increasing time spent on the social Web, and increasing importance of content aggregators over content creators, leading to more interlinked content and huge amounts of big data.
From this review, various recommendations for topics in a technology and research roadmap emerged: social influence and incentives, information tracking & dynamics, multimodal data processing, visualization and user interaction, and algorithmic fundamentals. An important task is now to present these topics to decision makers and show their relevance for the European citizens and eParticipation.
Joseph Mariani, CNRS-LIMSI/IMMI, presented the SRA priority theme “Socially Aware Interactive Assistants”. The aim is to create multilingual assistants which support human interaction, acting naturally and personalized in various environments, in any language and anywhere. Global abilities are needed for these assistants, like natural interaction with agents (e.g. terminals or robots). In addition, there are domain specific abilities like personalized training in computer aided language learning.
The roadmap for
this priority theme encompasses these global and domain specific aspects, and
the creation of language resources and evaluation tasks. Other countries (e.g.
The LT-Innovate Innovation Themes
Key participants of LT-Innovate presented aspects of the “innovation themes” which are under development.
Rubén Riestra, INMARK International Area, provided a general introduction to the envisaged “LT innovation agenda”. The aim is to produce a vision statement how innovation should enable LT providers to deliver value, that is: new products and services for the digital single market. LT-Innovate has identified five main “innovation clusters”: iEnterprise, iHealth, iHelpers, iServices, and iSkills.
Rose Lockwood , INMARK International Area, presented the approach for writing the innovation agenda. The aim was to create a consolidated view of the software market and the potential “LT market”. This should also include a commercialized LT view that will influence both LT companies and the research community. LT-Innovate has tracked LT related news intensively, leading to the five innovation clusters.
Philippe Wacker, EMF, emphasized the importance
of innovation for getting
Paul Welham, CereProc Ltd., presented findings from a panel discussion at the LT-Innovate summit about language technology for people with disabilities and special needs. The aging population creates many challenges, but it also leads to many opportunities for language technology applications. An example is avatars to support communication of elderly people.
Claude de Loupy, Syllabs, presented
opportunities for user and product analysis. Language technology can create
more value in areas like eCommerce or the travel industry. This industry is
Adriane Rinsche, Language Technology Centre Ltd., presented promises of language technology in the health care market. Language technology can help to save costs and improve services, e.g. for patient related information management or health monitoring. There are also multilingual aspects like medical information in tourism. Language technology tools that interface easily with each other and medical infrastructure will lead to excellent opportunities in this market.
The joint session between META-NET and LT-Innovate was wrapped up by a short discussion. One topic was the gap between what language technology already can achieve, and the needs of the end user. Some types of language technology are getting more and more uptake, e.g. speech interfaces. But wide spread adoption is yet to come. The overall usability of language technology has to become a focus of efforts, or in different words: we have solutions, but what was the problem?
Nicoletta Calzolari, CNR, and Georg Rehm, DFKI, chaired the “LT Fireworks” session. Georg Rehm briefly introduced the background of the META Seal of Recognition awards and the META Prize: these awards are given annually at the META-FORUM event, and winners are chosen by the META Technology Council: around 30 experts of the European LT landscape who provide the main input to the Strategic Research Agenda (SRA).
Alessandro Tescari received the seal of recognition for Pervoice. Pervoice provides speech recognition using large vocabularies and handling multiple languages for specific sectors. Solutions based on Pervoice include a remote transcription system, transcription workflow and subtitling solutions.
Siegfried Kunzmann received the seal of recognition for European Media Lab. The EML transcription platform helps to bring automatic transcription to various markets. One important usage scenario is the automatic transcription of voicemails to SMS, e-mail or mobile devices.
Jakub Zavrel received the seal of recognition for Textkernel. The Extract! and other Textkernel products use language technologies and machine learning for extraction of information in CVs. This saves time in processing CV’s into recruitment systems and eases aggregation of searchable information.
Heidi Depraetere, on behalf of Paraic Sheridan, received the seal of recognition for IPTranslator created within the PLuTO project. PLuTO is developing an online translation solution for patent translation. It helps the patent researcher to decide quickly whether a text in a foreign language is relevant for a given topic.
Bernardo Magnini, on behalf of Marcello Federico, received the seal of recognition for FBK. Here the IRSTLM toolkit for statistical language models is been developed. It provides a variety of features for creating languages models, is integrated e.g. into the MOSES platform, and has been used in various industrial applications.
Radu Soricut received the seal of recognition for SDL. SDL's machine translation system eases access to language pairs, integration with customer systems or control over corporate terms and brandings. High quality translation results can be delivered across 30 languages via post editing.
Radim Kudla received the seal of recognition for PHONEXIA s.r.o. PHONEXIA provides speech technologies for identifying various pieces of information from speech, e.g. different speaker, gender, language, keywords, transcription etc. The technologies are applied for example in multilingual speech transcription and keyword spotting systems.
Kirti Vashee and Dion Wiggins received the seal of recognition for Asia Online. Initially Asia Online focused on using machine translation for bringing English content into Asian languages. The scope then was extended to various domains and language pairs. Now also language pairs involving Asian and European languages are being included.
Joseph Mariani, on behalf of Bernard Prouts, received the seal of recognition for Vocapia. Vocapia has created VoxSigma, a software suite with large vocabulary speech-to-text capabilities. VoxSigma has been developed for transcribing large quantities of audio and video. It is used in many applications like media monitoring or speech analytics.
Tony O’Dowd received the seal of recognition for Xcelerator. KantanMT developed by Xcelerator is a cloud based machine translation system. It is based on the Moses platform and provides machine translation to mid-sized language service providers. KantanMT responds to the need of high-quality and low-cost machine translation.
Lauri Karttunen received the seal of recognition for XFST developed within Xerox. XFST is a finite-state toolkit for text processing, e.g. rewriting, tokenization or morphological analysis. Since 1993, it has been used for dozens of languages and in large cooperation. The source code of XFST is planned to be available soon under an open source license.
The members of the META Technology Council decided that the scope of the META Prize 2012 should be “Outstanding products or services supporting the European Multilingual Information Society”. There have been 19 nominations, and one clear winner: The prize was given to the JRC Optima Activity for Language Technology, represented at META-FORUM 2012 by Erik van der Goot.
JRC, the Joint Research Centre, is an EC’s in-house science service. One major application developed within JRC is the Europe Media Monitor (EMM). Starting in 2002, today EMM processes 150.000 new news articles - per day and in 50 languages. The articles are classified according to hundreds of subjects and countries.
JRC also has created language resources of enormous value, e.g. multilingual parallel corpora in 22 languages, multilingual multi-label categorisation software, and the multilingual named entity resource JRC-Names. These resources and EMM itself are of high importance for multilingual information gathering.
Kimmo Rossi, the European Commission, DG for Communications Networks, Content and Technology (CONNECT), gave the opening talk for the first session on the second day. He presented the current state of planning for two new programs: “Horizon 2020” and “Connecting Europe Facility (CEF)”.
In Horizon 2020, language technology is planned to be part of the industrial leadership topic with dedicated funding instruments for SMEs. Relevant topics are related to content technologies and information management, e.g. the creation of tools for handling content in any language, or modelling, analyses and big data visualization.
CEF, different to Horizon 2020, is not about research or innovation, but infrastructure. Digital service platforms in areas like eGovernment or eHealth are to be developed. Language technology comes into play via the requirement for multilingual access to online services. A core platform should provide basic language technology building blocks for free, accompanied by various generic services like machine translation.
Roberto Cencioni, the European Commission, DG for Communications Networks, Content and Technology (CONNECT), gave a presentation about “Final 2012/2013 calls in FP7”.
Themes in these calls include global content processing, mining of unstructured information and natural interaction. There are two calls, one dedicated to language, one esp. for SMEs including the areas of language and handling of big data.
Three research lines are formulated in the language related call: analytics, focusing e.g. on the interplay of text, speech, audio and video; translation, aiming at high quality MT; and interaction, with the goal to integrate processing of speech and additional modalities in ITC platforms. In addition there are roadmapping actions, which should target specific sectors, common tools, data sets & standards, integration and evaluation.
The SME call has a focus on analytics and open data. There are project lines for the re-use of open data, transfer and uptake of LT, and software focusing on open data and its applications.
Stelios Piperidis, ILSP, started the session on the open resource exchange infrastructure META-SHARE with a presentation entitled “Overview, Current State, Towards Version 3 of META-SHARE”.
Language resources (LR) are needed everywhere in language related technology. META-SHARE is a network of distributed repositories (so-called “nodes”) for sharing and exchanging LRs, aiming to match LR providers and consumers.
In META-SHARE, LRs are described via a dedicated metadata schema. It supports all services of the infrastructure like storage, browsing, or metadata harvesting. The metadata schema describes the LR itself and provides also additional information, related e.g. to licensing.
Such metadata is important for the legal framework used in META-SHARE. Various licensing templates are provided. They encompass a mix of open and openness inspired models.
In the coming months the META-SHARE software will be improved in various areas like search engine optimisation or data migration. More META-SHARE nodes will be created, and ELRA supported initiatives will be included, to achieve full deployment of META-SHARE from ELRA and its members.
Varadi, Research Institute for Linguistics,
One major aim is to contribute resources for these languages to META-SHARE. This encompasses monolingual corpora as well speech corpora, lexica or language technology tools. In addition, cross-linked resources between the six languages (e.g. multilingual parallel corpora) have been developed. A long-term perspective behind these efforts is important: CESAR is going to set up a META-SHARE repository / node for hosting these languages resources.
The project also contributed to the development of META-SHARE, which was the focus of the presentation. This includes among others input to the metadata model, legal or licensing aspects, and various technical areas.
The repositories /nodes have been populated with resources by METANET4U. Seven nodes have been set up. 100% of the resources that are available via these nodes are new, that is they have not been available via other distribution channels before. A future topic is the interoperability between META-SHARE and other platforms.
Vasiljevs, Tilde, gave a presentation entitled “The contribution of META-NORD”.
META-NORD covers the Baltic languages (
The focus of the contribution to META-SHARE was European languages with less than 10 million speakers. As the analysis in the language white paper series reveals, for many of these languages the amount of high quality languages resources is very limited.
META-NORD worked on filling gaps especially in the areas of WordNets, treebanks and terminology resources. Like in the other projects, which presented contributions to META-SHARE, the sustainability of the repositories is of high importance, and META-NORD has committed to provide support at least for a given time frame.
META-SHARE in 2013 and beyond – Q/A and Panel Discussion
The Q/A and Panel Discussion first focused on concerns about the future of META-SHARE. what will happen when the underlying projects come to an end? ELRA and others involved have committed to guarantee at least for two years and probably for longer that META-SHARE will receive support.
was the role of META-SHARE with regards to high quality language resources.
META-SHARE is not a means to create these resources, which are needed by the
SMEs constituting the majority of the language technology industry in
Various questions were about licensing. META-SHARE has been set up also to become attractive to the open source community. To this end, META-SHARE provides the necessary licenses. Nevertheless, the language technology community itself has expressed the need for restricted licenses. In this respect, the META-SHARE licensing options reflect the current thinking of the community.
Gruber, Hungarian Ambassador to
Diana Popova, Senior expert, Science Directorate, Ministry of Education, Youth and Science, presented the situation in Bulgaria. Language technology is part of the ICT vertical research. Here it has received funding since 20 years ago. Nevertheless compared to other countries the level of funding is still low.
Oliva, member of the Council of Research, Development and Innovations of the
Edouard Geoffrois, Ministry of Defense and French National Research Agency, presented the situation in France. Various national agencies cooperate to support language technology related topics. There are large, dedicated programs like Quaero and programs run in cooperation with other countries.
Alice Dijkstra, The Netherlands Organisation for Scientific Research (NWO), presented the situation in the Netherlands. A joint Dutch and Flemish program for language technology that lasted 2005-2012 will have no successor. Nevertheless, language technology can be funded via an “LT inside” approach. It can be part of other themes like the humanities or the creative industry. In addition, funding as part of infrastructure programs can be acquired rather easily.
Bergoč, Department for Slovene Language, Ministry of Education, Science, Culture
and Sport, presented the situation in
Slovenia. Language technology activities in
Mariani, CNRS-LIMSI/IMMI, presented the European Commission's Collaborative Research Instruments.
Member states and the EC need more coordination. From the various existing
coordination instruments, the “Article 185” seems to be well suited for
language technology. The 2008 European Council Resolution on “European strategy
on Multilingualism” provides important arguments towards policy makers for the
development of language technology in
The panel discussion brought up mainly two questions: what are the benefits of coordinated programs on the European level, and what is the best approach to create them.
As an answer to the first question, several national projects that targeted similar goals were mentioned. Running such projects without coordination leads to duplication of efforts, and basic tasks like data sharing are hard to achieve. The result is that critical mass compared to other regions in the world is hard to achieve.
dedicated funding on the national or the European level requires both a bottom
up approach, involving the leading experts in the field, as well as a
political, top down approach. A major argument towards politicians is that
multilingualism is the crucial asset of
Fernando Pereira, Google, gave the closing keynote of META-FORUM 2012, entitled “Low-Pass Semantics.”
At Google, a lot of effort is put into natural language processing. Nevertheless, the aim is not to achieve automatic sophisticated processing for small pieces of content, but to develop language technology workflows that scale to the whole web. Here, the web services both as a data source and as target content.
The presentation exemplified this approach with “Low-Pass Semantics”: its aim is to create links between natural language text, external knowledge bases like the so-called “knowledge graph” and other types of data.
Web pages often contain useful pieces of information, but they are hard to identify. The external knowledge basis contains keys or identifiers of concepts. In the low-pass semantic approach, these are linked to the text. This improves consistency in interpreting Web content.
The motivation for the approach described is not a research topic, but a user problem: the low precision of Web search. Methodologies from natural language processing play an important role. Grammar parsing or named entity recognition NER, applied in a robust and scalable manner, help to create better linkage to the knowledge base than pure matching of text patterns. But language technology alone is not sufficient: For web scale, computational power is extremely important, more than advancement of algorithms.
Uszkoreit, DFKI and META-NET, summarized in a brief closing session the next steps
for the language technology community in
months will decide about the shape of language technology, including the
financial support provided in
There is a lot of competition with other research fields – language technology is just one of them. If the community wants to assure support in the future, it needs to spread out widely with a positive message. In addition to the SRA, next year’s META-FORUM 2013 will be one main instrument to convey that message to everybody.