The tasks became increasingly difficult over the years (first with voice command, using an artificial language of 1,000 words, then voice dictation 6 http://itl.nist.gov/iad/mig/publications/ASRhistory/index.html Joseph Mariani (20,000 words), radio/tv Broadcast News transcription (in English, Arabic and Mandarin Chinese) telephone conversations transcription (also in English, Arabic and Mandarin), meeting transcriptions…), with variable conditions (real time or not, different qualities of sound recording). We see that for some tasks, the performance of systems is similar to those of a human listener, making these systems operational and marketable (such as for command languages). On the other hand, it is clear that for more complex tasks, performance improves more slowly, justifying the continuation of the research effort. Knowledge of these performances helps us to determine the feasibility of an application based on the quality level it requires. Thus, contrary to voice dialogue systems, an information retrieval system for audiovisual data does not require error-free performances in the transcription of speech.
A similar approach was used to monitor progress in machine translation (mt), using the bleu metrics, proposed in 2000 [PAPINENI ET AL. 2001], whereas the research had been conducted in mt for about fifty years without systematically measuring the quality of results to guide future research. This measure is based on a rudimentary comparison between the results of the systems and the translations of human translators.
Performance of machine translation systems in 22 official languages of the EU (Ph. Koehn et al. 2009) Joseph Mariani This table gives the best performance obtained for 462 pairs of official languages of the European Union (lacking Irish Gaelic), in terms of their bleu score (the higher the score, the better the translation, a human translator scoring around 80). The best results correspond to the languages that benefit from research efforts in coordinated programmes, and from the availability of many parallel corpora (English, French, Dutch, Spanish, German,…), the worst are languages that have not seen similar efforts, or that are very different from other languages (Hungarian, Maltese, Finnish …).
Referring to the initial issues, we can pick up the two key elements that are necessary for a language technology policy : the availability of monolingual resources and technologies in each language, in order to ensure the preservation of culture (and therefore of languages) and, at the same time, the availability of crosslingual resources (such as parallel corpora) and technologies for each pair of languages to be processed, in order to enable communication between humans.
But there is also an interest in developing concurrently monolingual technologies for each language in order to better address crosslingual technologies. This facilitates the coordination of efforts : standards for data exchange and tools, feedback of experiences, collections of Best Practices, and it is a necessity for applications such as speech translation (speech recognition in the source language, translation, then speech synthesis in the target language) or retrieval of crosslingual information (in order to produce a summary of information that has been found, regardless of the source language), or, more generally, for the localization of documents, which requires both crosslingual technology (translation…) and monolingual (such as spelling and grammar checkers…). And it also facilitates a shared effort between various laboratories around the world, working too often mainly on their national language, or on English only.
THE DIGITAL DIVIDE AND LANGUAGE COVERAGE There is currently a two-speed situation and a “digital divide” between languages for which technologies exist, and others. This is related to the “weight of languages” 7 [GASQUET-CYRUS, PETITJEAN 2009] It should be noted that 95 % of languages are spoken by only 6 % of world population. Some 7 See in this book : Daniel Prado, Language Presence in the Real World and Cyberspace.
Joseph Mariani linguists believe that 90 % of languages will have disappeared within a century. We can therefore classify languages according to the data and automatic processing systems that exist for these languages : whether they are well, less or not at all “resourced”, or indeed if they have only an oral tradition and no writing system at all. The availability of data is crucial for the development of usable systems, often based on statistical approaches.
Machine translation therefore requires parallel corpora, whose number is reduced. Therefore we try to overcome this gap by developing methods using noisy parallel corpora, comparable corpora (texts dealing with the same topic in different languages) or quasi-comparable corpora, which are more readily available, thanks especially to the extension of the Web.
In order to resolve this digital divide, how can we take into account “minority” languages, regional languages, languages spoken by migrants, foreign or regional accents Who bears the cost when these languages are of no economic or political interest, or are unrelated to armed conflicts or natural disasters that justify addressing them How to ensure that citizens in a community of states are able to communicate among themselves How to reduce the risk of conflicts and crises by allowing exchanges between people This is now a major social and political issue, which is the subject of much debate. Thus, the International Forum of Bamako, organised in January 2009 in pursuit of the outcomes of the World Summits for the Information Society in Geneva (2003) and Tunis (2005), concluded on a commitment to promote an ethical use of information in its linguistic dimension, allowing mother tongue education and ensuring the existence of a multilingual cyberspace, both in terms of content availability on the Web and of technologies to access it.
RESEARCH EFFORTS IN THE DOMAIN To produce the language resources and technologies that are needed to address multilingualism, different initiatives can be identified :
– those of big companies like Google or Microsoft ;
– national programmes in some countries, with different objectives : to process an internal multilingualism (tdil in India, nhn in South Africa) ;
to understand foreign languages for geopolitical reasons (gale or ears in the United States, funded by the Department of Defense – darpa) ; to ensure the use and promotion of a national or transnational language Joseph Mariani (TechnoLangue for French, stevin for Dutch/Flemish) ; or to maintain a place in an economic and cultural competition (Quaero in France) ;
– efforts to support R&D programmes of the European Commission ;
– international efforts to network the actors of the field, to better coordinate activities and promote greater sharing of resources (Oriental Cocosda, Clarin, FLaReNet, meta-net…) and the establishment of distribution agencies for linguistic resources, such as ldc in the United States or elra in Europe.
These various initiatives to address multilingualism have their advantages and drawbacks : sustainability, links with the scientific community, links with existing applications, quality control… Producers of Information Technology First, it must be underlined that large u.s. companies in the information technology sector make a major effort in multilingualism and crosslingualism. Thus, the Google search engines work in 145 languages (national and regional), and Google has made available “free” tools for machine translation and crosslingual information retrieval online : in April 2011, 52 languages (including Catalan and Galician) and 2,652 language pairs were available on the internet, and 58 languages and 3,306 language pairs were available on smartphones (including 16 languages with voice input, and 24 languages with voice output). The Google Book Search Library contained 7 million documents in 44 languages and in December Google provided statistics on the evolution of human language from a corpus of 500 billion words (including 361 billion words in English and 45 billion words in French and Spanish). Also Microsoft provides the MS Word spell checker in 126 languages (233 if we consider regional variants) and a grammar checker in 6 languages (61 if we consider regional variants).
Joseph Mariani National programmes addressing the issue of language technologies to help multilingualism :
TDIL in India, NHN in South Africa Major programmes were launched as part of public policy. The tdil programme (Technology Development for Indian Languages) is an important programme, which is one of ten priorities of the Indian national programme on the information society. The target is to process (Indian) English and eighteen “recognized” Indian languages 9, with several language technologies : machine translation, Text-To-Speech synthesis, speech recognition, search engines, optical character recognition (ocr), spelling checkers, language resource production ; all this for the group of nineteen languages. A comparable programme (nhn 10 : National Human Language Network) is taking place in South Africa for the automatic processing of the eleven national languages 11.
TechnoLangue : a programme for processing the French language In France, TechnoLangue [CHAUDIRON, MARIANI, 2006] 12, conducted from 2002 to 2006, was a national programme aimed at producing language resources (monolingual, specialised and bilingual dictionaries, lexicons, corpora, databases of terminology, and language processing tools, etc.) and at conducting evaluation campaigns for written and spoken language processing. Different campaigns have been conducted for the processing of French, on parsing, on automatic extraction of terminology, on search engines that provide answers to questions (Q&A), on text-to-speech synthesis, on spoken dialogue, and on the transcription of speech (for the automatic indexation of radio or television broadcast). In this framework, an important corpus was produced with 1,600 hours of speech, including 100 hours of transcriptions, that represents a million words and 350 registered speakers. A corpus of this size had not previously existed 8 http://tdil.mit.gov.in 9 Assamese, Bengali, Gujrati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Napali, Oriya, Punjabi, Sanskrit, Sindhi, Tamil, Telegu, Urdu.
12 http://www.technolangue.net Joseph Mariani for languages other than American English. It was therefore important to establish one for the French language, and it remains important to do the same for most languages in the world if we are to develop systems that process those languages automatically with a sufficient quality.
TechnoLangue also conducted two evaluation studies on crosslingual technology. One on the agreement of parallel texts, first between French and English, German, Italian and Spanish, and secondly, between languages having different alphabets : French and Arabic, Mandarin, Greek, Japanese, Persian and Russian. Finally, an evaluation study was conducted on the automatic translation between English and French and between Arabic and French, including a study of the evaluation metrics employed in machine translation.
QUAERO : a French programme for processing multilingual and multimedia documents The quaero 13 programme was launched in France in May 2008. It covers the processing of multilingual and multimedia documents. The programme is structured around the development of about thirty technologies involving different media (text, speech, image, video, music…) which meets the needs of a group of five different applications (digitisation platform ; media monitoring and social impact ; personalised video ; search engines ; communication portals). It is based on the use of corpora and of systematic performance assessment. It is expected to handle more than twenty languages. This programme, consisting of public and private partners, has a budget of 200 million euros, with a 100 million euros public funding provided through the oseo agency, over five years (2008-2013). Initial results have succeeded in the audiovisual area 14 (radio, television, online video…), in the Voxalead search engine, working in six languages (English, French, Spanish, Arabic, Mandarin and Russian) and developed by Exalead ; in an aggregator of plurimedia news (text, radio, television) developed by Orange ; or in a system to read e-books developed by Jouve.
13 http://www.quaero.org 14 http://voxaleadnews.labs.exalead.com Joseph Mariani Actions of the European Union From 2007 to 2010 the European Union benefited from having a commissioner specifically for multilingualism 15, who established a High Level Group on Multilingualism that produced a report 16, and who made a presentation to the Parliament and the European Council in September 2008 17. As President of the European Union, France in September 2008 organised the tats-Gnraux du Multilinguisme (Multilingualism Summit) at La Sorbonne (Paris), that was followed in November 2008 by a resolution of the European Council of Ministers on multilingualism, taken up by the European Parliament in March 2009 18. The idea of a “Single European Information Space” was highlighted.
The European Commission has supported several important projects on multilingual technologies under the 6th Framework Programme for Research and Development (clef, TC-Star, chil, etc.). In particular, the TC-Star 19 Integrated Project covered speech translation in three languages : English, Spanish and Chinese, through an application performing automatic translation of the speeches at the European Parliament.
Материалы этого сайта размещены для ознакомления, все права принадлежат их авторам.
Если Вы не согласны с тем, что Ваш материал размещён на этом сайте, пожалуйста, напишите нам, мы в течении 1-2 рабочих дней удалим его.