The standardisation of such identifiers was not a path of roses. Everything related to languages is extremely sensitive, and, for instance, considering an idiom as a language or as a dialect is not neutral, and can lead to anger and misunderstanding. It is partly to limit this risk that language tags relate to, wherever possible, other standards such as ISO 639 for languages.
rfc 5646 provides, in relation to these standards, the free availability of the standard, the possibility of combination (as in the example above) and stability (unlike iso identifiers, a tag is still valid, even if the identifier is removed or reassigned by the iso).
HISTORY If the status of mulilingualism on the internet today is quite good (almost perfect in terms of standardisation, less so for implementation and deployment), it was not always the case. All users of a certain age can remember when simply to send a message with a composite character required good computer skills and reading lots of documentation. The author remembers reading about the first mime software 22 and has fond memories of the long struggle of the 1990s enabling two French people to be able to send messages in correct French. Let us congratulate, a posteriori, therefore the geret 23 group members, who have done such a great job of consciousness-raising and training.
This long struggle has left a legacy, particularly the persistent urban legend that “the internet does not support accents” which has led some French speakers to self-limit themselves, even today, into using only ascii. Of course, their attitude is justified by the fact that we cannot, today, guarantee 100 % success, but, unfortunately, this blocks progress :
in 2011, users should no longer tolerate a system that prohibits the use of all characters in their language ! 22 It was named metamail and was a computer program that even the most dinosaur computer scientist of today would not like.
23 Groupe d’Exploitation des Rseaux Ethernet TCP/IP (Ethernet and TCP/IP Network Operations Group). It was a working group whose aim was to provide a forum for the exchange of experiences for engineers operating predominantly Ethernet and TCP/IP networks.
Stphane Bortzmeyer And what does the future hold In early 2011, the standardisation work is 95 % finished 24 and the problem henceforth concerns above all programming, deployment and content. The concrete work awaits those who really want help multilingualism on the internet ! GLOSSARY American Standard Code for Information Interchange (ASCII) An old (but still widely used) character set standardised in the U.S.A. having only the characters needed for English. As it was one of the first, and as its birth was in a founding country of computing, it has long been used as the basis for many network protocols.
Domain Name System (DNS) This term refers to both the system of domain names, the tree structure for creating identifiers such as cooptel.qc.ca or vliplanchiste.com, and the protocol enabling the retrieval of information such as IP address, the name of the mail server, etc. from such a name.
Internationalized Domain Names (IDN) The term IDN designates domain names expressed in Unicode, such as for example,. 25. It sometimes uses the acronym IDNA (Internationalised Domain Names in Applications) for the specific technique, in current use, that goes through a local conversion to ASCII before sending to the DNS.
Internet Engineering Task Force (IETF) The main standards organization for the Internet, notably charged with layers 3 (routing) to 7 (applications). It is distinguished by its great openness, its debates and its standards (the famous RFC) being public. http://www.ietf.org Indian Script Code for Information Interchange (ISCII) An old (but still widely used) character set standardised in India that covers much official paperwork in India (a very rare case in the world, India, like the European Union, has not only several official languages, but also several alphabets.) Multipurpose Internet Mail Extensions (MIME) An IETF standard giving structure to the content of an email message. This opens the possibility of using in a mail message sound, images, files of any format, and also text in any character set.
Requests for comments (RFC) A numbered series of official documents describing the technical aspects of the Internet, or of different associated hardware (routers, DHCP servers). Note that not 24 The two major gaps in the standard are in the Unicode FTP – File Transfer Protocol – and the passing of mail addresses in Unicode.
25 Using the national top-level domain domain of Tunisia.
Stphane Bortzmeyer all RFCs are official standards, some are qualified as “for information only”, and others as “experimental”.
Standards Development Organization (SDO) An organization, usually not-for-profit, that develops and maintains standards. The term is generally reserved for those relatively open organizations (like IETF, ITU or W3C) rather than those representing a cartel of businesses.
World Wide Web Consortium (W3C) The standards organization for Web-related formats such as HTML (format for web pages), XML (format for struct ured data) or CSS (layout of web pages). http://www.
w3.org BIBLIOGRAPHY [ANDRIES 2008] Patrick Andries. Unicode en pratique. 2008. Dunod.
[UNICODE STANDARD] The Unicode Consortium. The Unicode Standard, Version 6.0.0.
2010. The Unicode Consortium.
[GILLAM 2002] Richard Gillam, Unicode Demystified : A Practical Programmer’s Guide to the Encoding Standard, Addisson-Wesley, 2002.
[KORPELA 2006] Jukka K. Korpela, Unicode Explained, O’Reilly, 2006.
RFC [RFC 1341] N. Borenstein. N. Freed. MIME (Multipurpose Internet Mail Extensions) :
Mechanisms for Specifying and Describing the Format of Internet Message Bodies. 1992.
[RFC 2277] H.T. Alvestrand. IETF Policy on Character Sets and Languages. 1998.
[RFC 3490] P. Hoffman. A. Costello. Internationalizing Domain Names in. Faltstrom. P Applications (IDNA). 2003.
[RFC 4952] J. Klensin. Y. Ko. Overview and Framework for Internationalized Email. 2007.
[RFC 5198] J. Klensin. M. Padlipsky. Unicode Format for Network Interchange. 2008.
[RFC 5646] A. Phillips. M. Davis. Tags for Identifying Languages.2009.
Stphane Bortzmeyer MIKAMI YOSHIKI & SHIGEAKI KODAMA MEASURING LINGUISTIC DIVERSITY ON THE WEB The issue of localization in the information society has aroused curiosity, but also a great concern among many researchers. An interest that prompted the authors of this paper to create in 2003 the Language Observatory Project with the intention of measuring the extent of utilisation of each language. If everyone agrees on the need for such an assessment, the methodology of the observatory and its findings deserve attention for anyone who wants to understand the state of linguistic diversity in the digital world.
MIKAMI YOSHIKI is the director of the Language Observatory Project at Nagaoka University of Technology (Japan).
The project was initiated in 2003.
SHIGEAKI KODAMA joined as a researcher in 2006. This project studies linguistic diversity in the Cyberspace and has done a periodical research on the status quo of the linguistic diversity in the Cyberspace.
With the collaboration of CHEW YEW CHOONG, PANN YU MON, OHNMAR HTUN, TIN HTAY HLAING, KATSUKO T. NAKAHIRA, YOKO MITSUNAGA apid development in information technology is drastically changing communication around the world, extending its reach and Renriching its mode. However, new technologies do not evenly benefit all language communities, thus creating the possibility for a “digital language divide”. Let us consider an episode from the era of the printing revolution. In 1608, while stationed on the southwestern coast of India, Thomas Stephens, a Jesuit friar wrote to Rome :
“Before I end this letter I wish to bring before Your Paternity’s mind the fact that for many years I very strongly desired to see in this Province some books printed in the language and alphabets of the land, as there are in Malabar with great benefit for that Christian community. And this could not be achieved for two reasons ; the first because it looked impossible to cast so many moulds amounting to six hundred, whilst as our twenty-four in Europe”. [PRIOLKAR 1958] At the time that the friar wrote this letter, more than one hundred and fifty years had passed since Gutenberg’s innovation, but the new printing technology would not reach his parish until the XIXth century. As he mentioned, the main obstacle was the difficulty of introducing printing technology in the regional languages. Current terminology interprets this as a “localization problem”. The difficulty of casting a large number of metal typefaces would take on a different form in the age of computers and the Internet.
The question of the localization problem in an information society has aroused interest and concern for a number of researchers, including the authors. In 2003, we launched the Language Observatory Project, intending to measure the use of each language in cyberspace.
Mikami Yoshiki & Shigeaki Kodama The first section of this article describes why such measurements are necessary. The second section introduces the Language Observatory Project, and the third section provides recent results obtained from our observations.
WHY MEASURE Localization Still Matters The typing mould for printing technology was the equivalent of today’s computer technology’s character code. We now have an international standard on character code for information interchange, the iso / iec 10646 Universal Coded Character Set, abbreviated as ucs, or Unicode 1.
As the name implies, it covers an entire universe of character codes, from ancient writing systems such as Egyptian hieroglyphs and cuneiform, to minority scripts like those used in the deep mountainous regions of Southeast Asia by the speakers of Thai-Kadai languages.
But many problems in language processing remain. The most fundamental of them is that the ucs, contrary to its name, does not include the entirety of character sets used by humankind ; according to our study, many language users still face the same obstacles encountered by the Jesuit friar in XVIth century India.
The Mongolian language, for example, is written either in Cyrillic script or in its own historical and traditional script, for which at least eight different codes and fonts have been identified 2. No standardisation of typed font exists, causing inconsistency, even textual mistranslation, from one computer to another. As a result, some Mongolian web pages are made up of image files, which take much longer to load.
Indian web pages face the same challenge. On Indian newspaper sites proprietary fonts for Hindi scripts are often used and some sites provide their news with image files. These technological limitations prevent 1 Unicode is a standard created by the Unicode Consortium Inc. But its development and revisions are completely synchronised with the de jure standard ISO/IEC 10646. These two standards can be treated as one single standard.
See in this book : Stphane Bortzmeyer, Multilingualism and the Internet’s Standardisation 2 In addition to UCS/Unicode, BeiDaFangZheng, GB18030, GB8045, Menksoft, Sayinbilig, Boljoo and SUDAR. Most of them are proprietary, local codes used only by a limited group of users.
Mikami Yoshiki & Shigeaki Kodama information from being interchangeable, and lead to a digital language divide. Our research shows that use of ucs Hindi fonts is spreading, but that many web pages still depend on image files or proprietary fonts.
Such technical challenges maintain gaps not only between languages but between scripts. The authors’ initial motivation stems from this issue.
When the Language Observatory Project was launched, one of its founders wrote the following statement :
“My recent study based on statistical data provided by itu and Unesco gives a rough sketch of global digital-divide “among scripts”. Latin alphabet users, 39 % of global population, consume 72 % of world total writing/printing paper and enjoy 84 % of access to the internet.
Hanzi – Chinese ideograph, users in China/Japan/Korea, 22 % in global population, consumes 23 % of paper and have 13 % of internet access. Arabic users, 9 % in population, consume 0.5 % of paper and have 1.2 % of Internet access. Cyrillic script users 5 % in population consume 1.1 % of paper and have 1.6 % of Internet access. Then how about Indic script users If all Brahmi-origin scripts widely used in Southeast Asia – Myanmar, Thai, Lao, Khmer, etc. included, Indic scripts users occupy 22 % of world population, consume 2.2 % of paper and have just 0.3 % of internet access” [MIKAMI 2002].
The Language Observatory was launched to address and close these divides to ensure equality and diversity online.
English Dominance on the Web The second reason for the digital language divide is the dominance of the English language on the Web, which may also reflect the economic aspect of the Web’s evolution. This topic was first referred to in 1995 at the Francophone Summit in Cotonou. At that summit, the presence of English was publicly quoted as being above 90 %. Funredes (Fundacin Redes y Desarrollo), reacting to the figure, attempted to obtain accurate measurements of several languages including English [PIMIENTA ET AL. 2010].
Our research shows that in Asian and African ccTLD domains, English continues to dominate a full ten years after the summit. From 2006 to 2009, we conducted annual surveys of language presence on the web. A direct comparison is impossible because the numbers of pages collected Mikami Yoshiki & Shigeaki Kodama differs between studies (we are currently attempting to identify a methodology to normalise size disparity in sample collection using analysis of variance), but we can nonetheless make the observation that in all surveys English was the language most widely used in Asian and African domains.
The 2010 study found that English was used in 82.7 % of pages collected from the African ccTLD domains ; French came in second with 5.5 %. For Asian domains, English also placed at the top, but with a smaller proportion, about 39 %. This is because with the Asian domains, some regional languages, including Hebrew, Thai and Turkish, are strongly dominant in the ccTLD of each language.
In 2010, we extended the survey to include Caribbean domains, and found that Spanish was the most frequently used language with a ratio of about 55 %. English came in second with a ratio of about 33 %.