This country is known for its abundant language resources, and its ldi is the highest of all countries on the earth ; in Asia it is followed by India, East Timor, Bhutan, the Philippines, Iran and Indonesia. These seven countries have an ldi of over 0.8, and another 22 Asian countries have an ldi of over 0.5. On the opposite end, Korea (0.003), the Maldives (0.01) and Japan (0.03) appear to be quite monolingual societies.
In Europe, the highest ldi belongs to Belgium (0.75). It is followed by Bosnia (0.66), Serbia (0.63), Moldova (0.59), Italy (0.59), Latvia (0.58), Georgia (0.58), Macedonia (0.58), Switzerland (0.58), Albania (0.57), Andorra (0.57), Austria (0.54), Monaco (0.52), and Spain (0.51). These Mikami Yoshiki & Shigeaki Kodama fifteen countries have an ldi over 0.5. Countries with a dominant mother language, such as Germany (0.37), Russia (0.33), the Netherlands (0.29), and France (0.27), generally have lower ldis. The lowest in Europe is Hungary (0.02).
In the American continent, only three countries have an ldi of over 0.5 :
Belize (0.77), Trinidad and Tobago (0.70), and Canada (0.60). Spanish dominant countries generally have a low ldi.
In Oceania, the separation of peoples on small isolated islands has meant that islands tend to develop unique languages. Countries composed of multiple islands accordingly tend to display a higher ldi. The ldi of Vanuatu is 0.97 and the highest among the Oceania countries ; over languages are spoken in its islands. Other archipelago countries also show a high ldi : the Solomon Islands (0.97), New Caledonia (0.83), Micronesia (0.77), Fiji (0.61), and Nauru (0.60).
Local Language Ratio In the previous section, we reviewed the overall condition of linguistic diversity of the world based on data provided by Ethnologue, data that reflects the situation in the real world. Now we would like to move onto the main theme of this article, language diversity in the cyber world.
Since being launched, the Language Observatory has focused its attention on two continents, Asia and Africa. As mentioned above, the first observation results were reported during a workshop organised at Unesco headquarters in February 2005 ; they are fully documented in an article published in 2008. Recently, the project has completed another round of surveys of Asia, Africa and the Caribbean region based on 2009 data. The following sections will introduce an overview of this most recent study.
Here we propose a two-dimensional chart, which is tentatively named the ll-chart, because the chart has the Local Language Ratio on the horizontal axis and the ldi on the vertical axis. The purpose of this chart is to solve a problem we encountered when preparing an ldi chart based on data from cyberspace. It often happens that languages used on the Web are completely different from languages spoken in the real world. In many cases, the latter consists of local languages while the former mainly consists of global languages like English, French or Russian. And in those cases, Mikami Yoshiki & Shigeaki Kodama the high ldi of languages in cyberspace and that in the real world are not considered to be the same. We have to take into account some measurements about the presence of local languages, as presented in Figure 3.
Figure 3. Schematic diagram of the LL-chart Notice that all countries with a local language ratio P fall within the area between the two curves 1-[P2+(1-P)2], Liberson’s index in the case of two languages, and 1-P2, which gives the maximum value of Lieberson’s index 10. When P becomes larger than 0.5, the ldi becomes smaller and the plotted point will move towards the bottom-right corner. When P is small, there are two possibilities : either the vacancy of local language is filled by a dominant foreign language, in which case the ldi shrinks and the point moves down and to the left ; or the vacancy of local language is filled with multiple foreign languages, in which case the ldi grows and the point moves up and to the left.
10 Two curves provide the upper and lower limits. The upper curve indicates the LDI of a two-language community. As the addition of a third-language speaker to this community increases the average probability to encounter different language speakers, this value is the minimum LDI of more than two language communities. The lower curve indicates the LDI of a very special case, where each member, in addition to the local language, speaks an additional language, or the maximum LDI.
Mikami Yoshiki & Shigeaki Kodama Comparison by Region :
Asia, Africa, Europe and the Caribbean Based on data collected in November 2009, the ldi and local language ratio were calculated for all country domains in Asia and Africa. As we do not have data for European countries, we used Google’s page count by language. Figures 4, 5 and 6 show the Local Language Ratio – ldi chart for these three regions.
Asian ldis are plotted in Figure 4. China, Japan and Korea and some Arabic-speaking countries (Iraq, Saudi Arabia, and Jordan) are found in the bottom-right corner, while Vietnam, Thailand and Indonesia, Israel, Turkey, Georgia and Mongolia show a relatively high local language presence.
Of note here is the context of central Asian countries. Their web spaces are composed of local languages, with major components of English and Russian, although the emphasis changes by country. Kazakhstan, Kyrgyzstan, Tajikistan, and Uzbekistan have a major emphasis on Russian, while only Turkmenistan has an emphasis on English.
On the other hand, web contents in the Indian subcontinent have a nearly negligible local language presence. More than 70 % of these web contents are written in English.
The case of Laos is particular and deserves mention here. According to Ethnologue, the country’s ldi is only 0.674. Why then does it have such a high ldi on the Web The major reason for this is that the “.la” domain is actively marketed to foreigners, including customers connected to Los Angeles. As the domain is sold mainly to foreign industries and peoples, in the “.la” domain, just 8 % of web pages are in Lao.
ldis of African domains are plotted in Figure 5. The presence of local languages in African domains is far rarer than in Asian domains. For Arabic-speaking countries, the local language claims the majority only in Sudan and Libya ; Egypt, Mauritania, Tunisia and Tanzania, along with the rest of Africa, show very little local language presence on the Web.
However, several countries nevertheless show high Web ldis.
The ldis of European and some Anglophone domains are plotted in Figure 6. Local language presence is above 50 % with the exception of Slovenia and Denmark, whose countries’ web spaces are dominated by Mikami Yoshiki & Shigeaki Kodama English, resulting in a lower ldi. At the opposite extreme is the United Kingdom, which joins other Anglophone countries (usa, Australia and New Zealand) in displaying a characteristically low ldi.
Table 3. Language composition of the Asian and African web domains African Domains Asian Domains Language # of pages % Language # of pages % English 30,327,396 78.40 % Chinese 7,832,521 20.46 % French 2,737,455 7.08 % Japanese 5,287,655 13.82 % Afrikaans 660,510 1.71 % English 4,867,355 12.72 % Arabic 592,746 1.53 % Russian 1,611,339 4.21 % Chinese 391,745 1.01 % Korean 1,100,232 2.87 % Portuguese 348,131 0.90 % Vietnamese 710,048 1.86 % Russian 307,178 0.79 % Thai 544,561 1.42 % Spanish 276,126 0.71 % Indonesian 308,894 0.81 % Japanese 158,992 0.41 % Hebrew 89,076 0.23 % Others 879,605 2.27 % Others 14,055,334 36.72 % Not 2,005,311 5.18 % Not 1,867,355 4.88 % identified identified Total 38,685,195 100.00 % Total 38,274,370 100.00 % NOTE : Web data was obtained from the country-code domains of Asia and Africa in November 2009. For a list of domains, see the ANNEX.
Figure 4. Local Language Ratio and LDI of language composition for Asian domains Mikami Yoshiki & Shigeaki Kodama Figure 5. Local Language Ratio and LDI of language composition for African domains Figure 6. Local Language Ratio and LDI of Language Composition on the European Web for selected Anglophone countries Challenges and Directions The most serious challenge to measurement efforts comes from the sheer size of the growing Web. Nobody knows exactly how many web pages exist on the entire Web. In 1997, the number was estimated at only million ; by 2002, it had grown to 8 billion [MILLER 2007]. In 2008, Google Mikami Yoshiki & Shigeaki Kodama announced 1 trillion urls on the Web, but has since stopped providing data. Nor do other search engines provide such data, which leads us to conclude that it is not currently possible to count all the existing pages on the Web.
Another strategy is needed to create a sampling method of pages that can reflect the entire Web. We are currently developing what we believe to be a promising method using anova (Analysis of Variance).
Another advantage of sampling is extending the research target to other ccTLDs that have not yet been targeted because of their huge size.
We have provided yearly reports on the statistics of language use on the Web at events held by Unesco or igf, and will continue to provide them in the future, with the following improvements :
– Extension of target ;
– Extension of identifiable languages ;
– Diversification of analysis method.
The first improvement was mentioned above. Our research target areas currently include Asia, Africa, the Caribbean, and Europe. Many ccTLDs are still lacking in our research because of the storage capacity of our system.
The second improvement will help draw a more accurate image of language use on the Web. Our identification engine can identify more than 300 languages, but by Ethnologue’s estimation, over 7,000 languages exist on Earth. As many do not have a written form and are only spoken, as shown in Table 3, our identifier could not identify about 5 % of collected pages, leading us to conclude that we are overlooking many languages. As mentioned in Section 1.1, we need to collect local encodings to investigate problems with legacy encoding.
A prototype of the third improvement was displayed in Section 3. The most basic data we can provide is a list of the number of pages in each language on each ccTLD. But these data do not tell us much about language use on the Web. We need to increase the sophistication of our interpretation to enable deeper reflection on digital language use.
With those improvements, we hope to increase the usefulness of statistics as fundamental data for considering language usage and diversity on the Web.
Mikami Yoshiki & Shigeaki Kodama BIBLIOGRAPHY [PRIOLKAR 1958] A. K. Priolkar. 1958. The Printing Press in India. Bombay : Marathi Samsodhana Mandala.
[MIKAMI 2002] Yoshiki Mikami. 2002. Global digital-divide among scripts. VishwaBharat.
October 2002 Issue, p.1.
[PIMIENTA ET AL. 2010] Daniel Pimienta, Daniel Prado and lvero Blanco. 2010. Twelve Years of Measuring Linguistic Diversity in the Internet. Paris : Unesco.
[HTUN ET AL. 2010] Ohnmar Htun, Shigeaki Kodama and Yoshiki Mikami. 2010. Analysis of Terminology Terms in Multilingual Terminology Dictionary. Proceedings of the 8th International Conference on Computer Applications 2010, pp. 122-128.
[UNESCO 2003] Recommendation Concerning the Promotion and Use of Multilingualism and Universal Access to Cyberspace. Unesco : Paris.
[ISC 2010] Internet Software Consortium. 2010. Internet Domain Host Count http://www.isc.org/solutions/survey [UNESCO 2001] Universal Declaration on Cultural Diversity. Unesco : Paris.
[MIKAMI ET AL. 2005] Yoshiki Mikami, Zavarsky Pavol, Mohd Zaidi abd Rozan, Izumi Suzuki, Masayuki Takahashi, Tomohide Maki, Irwan Nizam, Massimo Santini, Paolo Boldi, and Sebastiano Vigna. The Language Observatory Project. 2005. Proceedings of the 14th International World Wide Web Conference, p. 990.
[SUZUKI ET AL. 2002] Izumi Suzuki, Yoshiki Mikami, Ario Ohsato. 2002. A Language and Character Set Determination Method Based on N-gram Statistics. ACM Transactions on Asian Language Information Processing, Vol. 1, No.3. pp. 270-279.
[NANDASARA ET AL. 2008] S. T. Nandasara, Shigeaki Kodama, Chew Yew Choong, Rizza Caminero, Ahmed Tarcan, Hammam Riza, Robin Lee Nagano and Yoshiki Mikami. 2008.
An Analysis of Asian Language Web Pages. The International Journal on Advances in ICT for Emerging Regions (ICTer), Vol.1 No.1. pp. 12-23.
[LIEBERSON 1981] Stanley Lieberson and Anwar S. Dil. 1981. Language Diversity and Language Contact : essays. California : Stanford University Press.
[MILLER 2007] Miller, Colleen. 2007. Web Sites : Number of Pages. NEC Research.
IDC. 6 June 2007.
Mikami Yoshiki & Shigeaki Kodama JOSEPH MARIANI HOW LANGUAGE TECHNOLOGIES SUPPORT MULTILINGUALISM The issues of multilingualism are many, and the need for it is important, both in Europe and internationally. Language Technologies can help us respond, but it is necessary to develop infrastructures and generate the resources needed to conduct research on the different languages.
Some programs support this domain, but suffer from a lack of scale, continuity and cohesion. This effort deserves to be coordinated among nations and international agencies to facilitate multilingualism in Europe and globally.
Original article in French.
Translated by John Rosbottom.
JOSEPH MARIANI is currently director of the French-German Institute for Multilingual and Multimedia Information (IMMI).
He was Director of LIMSI-CNRS and Head of its HumanMachine Communication departement, then Director of the Information and Communication Technology Department at the french Ministry of Research.
ince the divine punishment of Babel, mankind must live with the wealth of a multitude of languages and cultures. The difficulty Sand costs of sharing information and communicating, despite the language barriers, while preserving these languages, could benefit from the support of automatic language processing systems (that we will call language technologies), which are the object of a major research effort, although still insufficient and insufficiently coordinated.
THE ISSUES OF MULTILINGUALISM The issues of multilingualism are twofold :
Материалы этого сайта размещены для ознакомления, все права принадлежат их авторам.
Если Вы не согласны с тем, что Ваш материал размещён на этом сайте, пожалуйста, напишите нам, мы в течении 1-2 рабочих дней удалим его.