Multilingual Knowledge Our study indicates an interesting result in terms of the gap between the availability of practical and professional knowledge among languages.
We analyzed the availability of 100 science and engineering terms in the online encyclopedia Wikipedia. One term, nitrogen, was found in different languages. All 100 terms were found in English ; among the European languages, the average number of terms found was about 30. In Asian languages, only 18 terms were available, while in African languages, the average number of available terms was just 7. It is often said that the internet revolution will universalise accessibility to the knowledge society ;
in reality, however, we find significant gaps between languages in terms of accessing professional knowledge. More details on this study can be found in [HTUN ET AL. 2010].
Mikami Yoshiki & Shigeaki Kodama Table 1. Availability of 100 science and engineering terms on Wikipedia, by language Number of European* asian african Others available terms N 0-9 0 2 0 10-19 31 30 6 20-29 10 8 1 30-39 4 4 1 40-49 3 2 0 50-59 11 1 0 60-69 7 4 0 70-79 5 4 0 80-89 6 1 0 90-99 2 0 0 100 1 0 0 Average number of 30 18 7 available terms Number of lan- 80 56 8 guages checked *European languages include English.
Unesco Initiatives Since language is a key purveyor of culture, language diversity cannot be avoided in the wider discussion of cultural diversity. Unesco has been engaged in the preservation of cultural diversity since its founding on the grounds that “intercultural dialogue and respect for cultural diversity and tolerance are essential to building lasting peace” [UNESCO 2003]. Unesco has since repeatedly published declarations and recommendations relating to cultural diversity.
In October 2003, the Unesco Member States adopted the Cyberspace Recommendation, affirming the importance of cultural diversity emphasizing Unesco’s responsibility in maintaining it. Unesco is now responsible for developing multilingual content and systems as well as public domain content, and facilitating access to networks and services. Under this recommendation, various activities including wsis (World Summit on Mikami Yoshiki & Shigeaki Kodama the Information Society), igf (Internet Governance Forum), and imld (International Mother Language Day), were planned and carried out.
THE LANGUAGE OBSERVATORY The Language Observatory Project was founded in 2003 after Unesco’s Cyberspace Recommendation. The project’s main objective is to observe and provide data on the real state of language use on the web to the end of examining language diversity on the web.
How it Works The Language Observatory is designed to measure use of each language on the World Wide Web. Measurement is effected by counting the number of pages on the Web written in each language.
The project consists of two major components. The first is a data collecting instrument from the Web using crawler robots, which, together with high-performance parallel crawler software developed at the University of Milan, [MIKAMI ET AL. 2005] can collect millions of Web pages per day.
The second component is a language identification instrument. As the Language Observatory has developed software to identify language, script and encoding properties of Web pages with high accuracy and maximum coverage. The first version of the identification algorithm lim (Language Identification Module) was developed by Suzuki et al. in 2002 [SUZUKI ET AL. 2002] and implemented by Chubachi et al. in 2004.
It was later improved by Chew in 2008 for a second version and the one currently in use, g2li.
g2li is capable of identifying 184 languages in iso Language Code (ISO 639-1) with an average accuracy of 94 % according to a recent verification examination. In addition to a wide coverage of languages, it can identify various types of legacy encodings 3, which are still extensively used by many non-Latin-script user communities, as mentioned in the first part of this article. The second version employs improved preprocessing 3 Legacy encodings are non-standardised, and often proprietary encodings.
Mikami Yoshiki & Shigeaki Kodama procedures and is capable of properly handling html entity encoding 4, which is also extensively used in many non-Latin scripts. Due to these special features, the authors believe that g2li is the most suitable language identification instrument for the measurement of language on the Web.
A Hidden Component : The Universal Declaration of Human Rights Hidden inside the language identification instrument is a set of training texts for the software. The technical details are provided in [SUZUKI ET AL.
2002], but it should be mentioned that the richness and the quality of training texts is the most critical in language identification task. A set of texts translated from the Universal Declaration of Human Rights (udhr) provided by the un Higher Commission for Human Rights (unhchr) were used for this purpose because of their wide coverage of the world’s languages.
Of note is that not all translated udhr texts are provided with encoding ;
some are available only as image files. Image files can be read by humans but not directly by computers, necessitating that we transform images into text data. Table 2 illustrates how many transformed texts are given in image format (322 languages were available at the date of the first search, in early 2004). More than two hundred languages use Latin script, with or without diacritics, and only three of them were given in pdf or gif file format. In contrast to this, among languages using Abugida script 5, not a single language was presented in the form of encoded text.
4 HTML entities represent characters using only ASCII letters (e.g. α entity represent greek character ).
5 Abugida scripts are syllabic scripts, most of which are generated from Indian Brahmi scripts and currently used in South and Southeast Asian regions. Another important Abugida script is Amharic.
Mikami Yoshiki & Shigeaki Kodama Table 2. Number of available UDHR texts from UNHCHR website by format Latin Cyrillic Other abjad abugida Hanzi all Total alphabet others En- 253 10 1 1 0 3 0 coded PDF 2 42 3 10 0 4 GIF 1 30 9 15 0 1 Total 256 17 3 13 29 3 5 NOTE : Other alphabets : Greek, Armenian and Georgian ; Abjad : Arabic and Hebrew ;
Abugida : Amharic and all Brahmi origin scripts used in south and southeast Asia ; Hanzi :
Chinese, Japanese and Korean ; All others : Assyrian, Canadian syllabics, Ojibwa, Cree, Mongolian and Yi.
This fact might itself point to the existence of a digital language divide, or in this particular case, a “digital script divide”. Upon first encountering this problem, one of the authors elaborated in an essay for the Indian journal Vishbha Bharat :
“Recently I visited a website of the United Nations Higher Commission for Human Rights 6 which introduces more than three hundred different language versions – from Abkhaz to Zulu of the Universal Declaration of Human Rights. The site claims that this text is the most widely translated text in the world, and has been awarded the Guinness World Record for having done this great job. Thus the Universal Declaration is “the most universal text” in the world.
Try now ! And you can find all eighteen Indian official language versions of the 1,778 words text, with only two exceptions – Konkani and Manipuri. But really disappointing for you would be the fact that all Indian language versions are just posted as “gif” files, not in the form of encoded texts. And actually many other non Latin scripts users in the world have to feel the same kind of sadness after visiting”. [MIKAMI 2002] Since then, many collaborators have voluntarily helped us to create a text version of these image files 7. For certain languages, we are still seeking 6 http://www.unhchr.ch/udhr 7 Sinhala, Vietnamese, Bahasa Melayu, Lao, Persian (Farsi), Mongolian, Tamil, Uyghur, Nepali, Malayalam, Hindi, Magahi, Marathi, Sanskrit, Bengali, Saraiki, Punjabi, Gujarati, Kannada, Myanmar, Vietnamese in TCVN5712, VIQR, VPS, Assamese, Azeri, Dari, Kyrgyz, Mikami Yoshiki & Shigeaki Kodama appropriate collaborators and have had to renounce the inclusion of training texts in those languages.
Around the same time as we launched the Language Observatory Project, Eric Miller launched udhr-in-Unicode. The objective of this project was to demonstrate the use of Unicode for a wide variety of languages, using the Universal Declaration of Human Rights (udhr) as a representative text. Currently, udhr-in-Unicode is housed on the Unicode Consortium website and the texts are used in the study of natural language processing 8.
Sponsors and Collaborators The Language Observatory project was initiated by the authors in and received funding from Japan Science and Technology Agency (jst) through its ristex program from 2003 to 2007. The kick-off event, held at Nagaoka University of Technology on February 21, 2004, included guest Paul Hector, then director of Unesco’s Communication and Information (ci) section.
The project interacted and collaborated with many partners from various parts of the world, and joined with the African Academy of Languages (acalan) at wsis in Tunis in November 2005 at a session on African languages. Among the attendees were the President of acalan, Adama Samassekou, Daniel Pimienta of Funredes, and Daniel Prado of Union Latina.
We agreed to organise a joint African web language survey project. The project’s initial target was initially the African ccTLD domain. In 2006, we held a workshop in Bamako, Mali, with the cooperation of acalan and the support of jst. Many African researchers interested in language diversity and the digital divide on the web attended the workshop.
After a fruitful workshop in Bamako, we planned a workshop to publicise our project and the digital language divide. We also held workshops at Marwari, Sindhi, Tajiki, Tamang, Telugu, Turkmen, Urdu, Uzbek. Unless otherwise noted, texts were prepared in UTF-8 encoding. UTF-8 text is not enough for our purpose in some languages which use non-standard, legacy encodings.
For more details and contributors’ names, visit our site : http://gii2.nagaokaut.ac.jp/gii/ lopdiary.php itemid=8 NLTK (Natural Language Tool Kit) by Steven Bird et al. is one example.
Mikami Yoshiki & Shigeaki Kodama Unesco headquarters in Paris for International Mother Language Day in 2007 and 2008.
The first complete observation report, published in 2008 [NANDASARA ET AL.
2008], was the first article addressing language distribution on the Asian web. The report confirmed a significant digital language divide. English used in more than 60 % of web pages in south Asian and southeast Asian countries. In west Asia, English dominance was less outstanding, and in some countries, Arabic was most widely used. In central Asia, Russian was the dominant language, except in Turkmenistan where English was used in 90 % of web pages. A minority of indigenous languages, including Turkish, Hebrew, Thai, Indonesian, Vietnamese and Mongolian, were the most used languages in their country domains. The study signified a breakthrough in understanding online language disparity, and provided a basis for future work.
LINGUISTIC DIVERSITY ON THE WEB In this section, some results of the Language Observatory’s language surveys will be introduced.
Lieberson’s Diversity Index Lieberson’s Diversity Index (ldi) [LIEBERSON 1981] is a widely used index of linguistic diversity that is defined by the following formula, where Pi represents the share of i-th language speakers in a community :
ldi = 1 - PiIf anyone in a community speaks the same language, then Pi = 1 and for the speakers of other languages, Pi = 0. Thus the ldi of a completely monolingual community is zero. If four languages are spoken by an equal number of people, then P1 = P2 = P3 = P4 = 0.25 and the ldi of this multilingual community can be calculated as ldi = 1 – (0.25)2 4 = 0.75.
Thus a higher ldi means larger linguistic diversity and a lower ldi means lower diversity.
Lieberson also took into account the fact that bilingual or multilingual speakers would render the formula a bit more complicated. But the basic idea of ldi can be explained by the illustration in Figure 1. A square of Pi Mikami Yoshiki & Shigeaki Kodama means the probability that the i-th language speaker meets with a speaker of the same language. And the sum of Pi squares represents the combined probability of any speaker meeting with a speaker of the same language in the community on average. Finally the sum of Pi squares is subtracted from 1, indicating the probability that any speaker will encounter different language speakers in a society. The dark-colored areas of the square in Figure 1 correspond to this probability.
Figure 1. A graphic interpretation of LDI Ethnologue provides a complete list of ldi data for each country or region, together with population size and the number of indigenous and immigrant languages. Based on this data 9, Figure 2 was prepared by the authors to show how ldi changes across countries and across continents.
Each circle represents a country in this chart. The circle’s size corresponds to the country’s population, and its vertical axis represents the country’s ldi. The two large circles on the axis of Asia correspond to India (ldi = 0.94) and China (ldi = 0.51).
9 Based on the web version, an equivalent of the 16th edition of Ethnologue.
Mikami Yoshiki & Shigeaki Kodama Figure 2. Lieberson’s Diversity Index of countries by continent (based on data from Ethnologue) As the chart illustrates, countries in the African continent have the highest language diversity among the continents, followed by Asia, Europe, America (North and South America included) and Oceania.
The highest ldi in Africa is of the Central African Republic (ldi = 0.96) ;
nine other countries have an ldi over 0.90 (the Democratic Republic of Congo, Tanzania, Cameroon, Chad, Mozambique, Uganda, Benin, the Ivory Coast and Liberia). Thirteen countries with an ldi above 0.80 (Togo, Zambia, Kenya, South Africa, Mali, Guinea-Bissau, Nigeria, Ethiopia, Congo, Sierra Leone, Angola, Namibia and Ghana), and seventeen have an ldi of over 0.5. The lowest ldi countries on the African continent are Rwanda and Burundi, with 0.004.
In Asia, the highest diversity is observed in Papua New Guinea (0.99).
Материалы этого сайта размещены для ознакомления, все права принадлежат их авторам.
Если Вы не согласны с тем, что Ваш материал размещён на этом сайте, пожалуйста, напишите нам, мы в течении 1-2 рабочих дней удалим его.