First, to take care of preserving cultures and languages, i.e. to allow citizens to express themselves in their first language. This question takes on a particular depth in the context of the construction of Europe, given the strong linguistic diversity within a single political entity. Thus, 75 % of Germans citizens questioned prefer to find websites in their own language rather than in a foreign language. One can also note that currently it is estimated that less than 30 % of the web is in English, a proportion that has declined sharply from a rough estimate of 50 % in 2000 1. 50 % of European citizens speak only one language and when they speak a second one, it is not necessarily English. Only 3 % of Japanese speak a foreign language. In India, less than 5 % of people fluently speak English. Preserving languages and, through them, their corresponding culture responds to a strong demand from citizens.
The second challenge is to enable communication among humans, usually in the framework of common democratic structures. We are facing 1 See in this book : Michael Oustinov, English Won’t Be the Internet’s Lingua Franca.
Joseph Mariani it in the European Union, where there are now 27 member countries and 23 official languages, representing 506 language pairs. If one considers all the European languages, one can count over 60, which represents almost 4,000 pairs of languages to translate ! The European Commission employs more than 2,500 translators who in 2007 translated over a million and a half pages. This covers only a fraction of the needs. To cover the totality would require 8500 translators to process 6.8 million pages annually. Taking into account the eu linguistic diversity represents 30 % of the budget of the European Parliament, or about 300 million euros per year, with the use of 500 translators and interpreters. The estimated total cost of multilingualism for the European Union is a little over one billion euros per year ; but considering the number of Europeans, that represents only 2.2 euros per citizen per year, which ultimately is not prohibitive. A similar situation exists within some nations, like India, but also internationally, with about 6,000 major languages that are spoken, or 36 million pairs of languages to translate… And a simple statistic : at present YouTube, every minute, uploads thirty two hours of new videos in all languages.
NEEDS RELATED TO MULTILINGUALISM At the European level, the needs related to multilingualism are very numerous : needs for the establishment of the European Digital Library (Europeana, which included, in January 2011, 14.6 million documents in 26 languages), for which it is necessary to provide crosslingual and multilingual tools to enable access for all ; for the realisation of a multilingual platform for alert and information exchange planned by the European Security Agency (enisa) for the Member States ; for the European Patent Office – The London Protocol has reduced the number of official languages to three (English, German and French) for reasons of cost, whereas, with more automated tools, more languages could be handled ; for meetings of the European Commission, of the European Parliament or of the European Court of Justice, where English tends increasingly to become the only working language… Such needs respond to a real democratic necessity, to be met more generally at the international level. If we take the example of Internet governance within the un Internet Governance Forum (igf), only English is accepted as a working language, and a lively debate concerns the possibility of using Joseph Mariani different spellings and different accents in the domain names. The World Digital Library in Unesco had 1,500 documents filed in 7 languages at its inception in April 2009 2. Dubbing and subtitling of audiovisual works ;
writing technical manuals, in the aerospace or automotive industries, or instruction manuals for the consumers ; live super-titling of works of performing art ; translation of texts, videos, and radio or television programmes that are innumerable, and in all languages ; simultaneous interpreting at multiple meetings, conferences, workshops, courses, which take place throughout the world : there are many applications where language technologies can offer opportunities. Think also of the urgent needs related to scientific articles written in a mother tongue, which are diminished markedly due to the overvaluation of English by bibliometrics, risking the loss of specialised terminology in other languages.
Add to this picture the many needs related to the accessibility of information by the visually or hearing impaired, requiring the translation of information from one medium to another : written to oral, oral to written, oral to gesture (sign language), and more generally to the accessibility of information by people who do not speak fluently the language in which it was encoded, including, notably, migrants 3.
FINDINGS The extent of these needs shows very well that they cannot all be covered by existing or even future human resources of professions dealing with language processing.
Taking into account multilingualism is not a top priority in any economic sector. If we ask the boss of a big company what is his/her priority, none will say it is multilingualism. But if we add up the priorities in each area where it is necessary to take it into account, then we reach a very large sum. This therefore requires, in our opinion, thought and political action to bring out this awareness and provide appropriate responses.
Even when multilingualism is seen as a necessity, its cost is still very important. It is this gap that calls for the development of language technologies and their utilisation when their performance is up to the needs of target applications.
2 http://www.wdl.org/fr 3 See in this book : Viola Krebs & Vicent Climent-Ferrando, Languages, Cyberspace, Migrations.
Joseph Mariani It should be noted that currently, language technologies have not yet reached maturity for all languages, with strong imbalances among languages. And they do not provide for human intervention. Thus, automated translation is not good enough to translate literary works or, in general, texts which require high quality translation. This must be said clearly. But on the other hand, it can help a human translator in his or her work and has a sufficient quality to give an approximate translation, of web pages for example, thus meeting the needs of the general public.
Language technologies can more fully participate in solving the issue of multilingualism, which justifies drawing attention to their merits, especially in the funding of research programmes.
LANGUAGE TECHNOLOGIES Language technologies are said to be monolingual when they handle a single language, multilingual when the same technology processes several (individual) languages, or crosslingual when they allow for switching and transferring from one language to another.
Language technologies cover the processing of written language, whether monolingual (morphosyntactic and syntactic analysis ; text understanding ; text generation ; automatic summarisation ; terminology extraction ;
information retrieval ; Qestion & Answer systems, etc.) or crosslingual (automatic or computer-aided translation ; crosslingual information retrieval, etc.).
For the processing of spoken language, there are also monolingual technologies (speech recognition and understanding ; speech-to-text transcription (textual transcription of what has been said) ; speech synthesis ;
spoken dialogue ; speaker recognition, etc.) and crosslingual (identification of a spoken language, speech translation, real-time interpretation, etc.).
Finally, we must not forget gestural communication, particularly for processing Sign Languages (recognition, synthesis and translation) 4.
These technologies can be intermedia, i.e. translating from one medium to another, with numerous applications to enable accessibility for 4 See in this book : Annelies Braffort & Patrice Dalle, Accessibility in Cyberspace : Sign Languages.
Joseph Mariani the disabled (Text-To-Speech synthesis for the visually impaired, automatic transcription (subtitles or supertitles), aids to lip reading, Sign Language processing… for the hearing impaired, voice commands for the motor-impaired…).
In language science and technology, research initially covered two areas under two different scientific communities :
• The processing of written language (also called automatic language processing, or natural language processing (nlp)), coming from linguistics and artificial intelligence ;
• The processing of spoken language (called “speech communication”), coming from acoustics, signal processing and pattern recognition.
These two communities have gradually come together, due to a political will and to the use of complementary methods based on machine learning with statistical modelling.
Research in these two major areas has made great progress on the lower levels of language processing : regarding written language processing, in text segmentation, lexical analysis, morpho-syntactic and syntactic analysis ; and regarding spoken language processing, in speech recognition, Text-To-Speech synthesis, or speaker recognition.
Numerous resulting applications are now in everyday use, such as, regarding written language processing, spelling and grammar checkers, monolingual and crosslingual search engines, online machine translation… and, regarding spoken language processing, talking gps systems, dictation systems, transcription and automatic indexing of audiovisual content… This list shows that many of these existing applications are related to linking spoken and written language (transcription of speech into text, speech synthesis from text). Spoken dialogue systems, including voice recognition and synthesis, are also growing, but in very specific applications : Voice command on mobile phones, Call centres, tourist or public transportation information, etc.
Basic architecture of a natural language processing system 5 META-NET White Paper Series, 2011.
Joseph Mariani Research in the field of automatic machine translation illustrates particularly well the meeting of these two communities. This area has traditionally been studied by researchers in nlp, using a rule-based approach including a combination of rules and linguistic knowledge (bilingual dictionaries, grammars, etc.). Researchers working in the field of spoken communication have for their part experimented in machine translation the machine learning methods that they have successfully used in speech recognition : matching the same text in two languages (parallel corpora), with the same approach used to match a speech signal and its written transcription. This statistical approach has resulted in significant progress leading to the recent development of hybrid translation systems, mixing statistical approaches and linguistic knowledge.
The challenge now is to process information related to meaning, at the semantic and pragmatic levels, in order to establish a natural dialogue between human and machine, or to give the machine the ability to participate in communication between humans. To do this, we need to take into account other communication modalities (multimodal communication, processing of multimedia documents), as well as the processing of paralinguistic information (prosody, expressions of emotion, analysis of opinion and feelings).
LANGUAGE RESOURCES AND EVALUATION It is crucial for conducting research aimed at developing language technologies to provide a base that includes both language resources and evaluation methods for the technologies that are developed.
With regard to language resources, the data (corpus, lexicons, dictionaries, terminology databases, etc.) are both necessary for conducting research investigations in linguistics and for training automatic language processing systems that are based in most cases on statistical methods. The greater the amount of data, the better the statistical model and therefore the better the system performances. The interoperability of language resources also invites us to think more deeply on the standards to be put in place in order to organise, browse, and transmit data.
It is also necessary to have a means for evaluating these technologies in order to compare the performance of systems, using a common protocol with common test data, in the context of evaluation campaigns. This allows for comparing different approaches and using an indicator of the quality of Joseph Mariani the research and of the advances of technology. We now speak of “coopetition”– a mix of international competition and cooperation – and this has become a way to carry out technological research. The Defense Advanced Research Projects Agency (darpa) of the Department of Defense in the United States, was the initiator of this approach in the mid 80s, through the National Institute of Standards and Technology (nist) [MARIANI 1995].
History of speech recognition since 1987 according to the NIST evaluation campaigns This table shows the progress of Automatic Speech Recognition over the years, through the international evaluation campaigns conducted by nist. Shown on the chart are the best performances obtained that year, in terms of Word Error Rate (wer) in a logarithmic scale. The effort to go from 100 % error (where the system does not recognise any word) to 10 % is comparable to that required to go from 10 % to 1 % error rate.
Материалы этого сайта размещены для ознакомления, все права принадлежат их авторам.
Если Вы не согласны с тем, что Ваш материал размещён на этом сайте, пожалуйста, напишите нам, мы в течении 1-2 рабочих дней удалим его.