Pages:     | 1 |   ...   | 10 | 11 || 13 | 14 |   ...   | 51 |

Stphane Bortzmeyer away everything to try to remake the internet better, any more than you can erase a town to improve urban planning.

STANDARDS THAT MAY INFLUENCE MULTILINGUALISM In what ways can a standard, a document in general rather technical and dry, help or hinder multilingualism Consider a few examples of the importance of standards. One of the best known cases is that of character sets. To write a human language in a computer we need to represent each character in the text by a number, the only objects that programs can handle. We must therefore establish a list of characters (this is already a formalisation step that is not obvious, and is not yet done for all alphabets) and we assign each a number. If two people who write in Tamil use two different character sets 5, their messages will be incomprehensible to one another. A standard character set is needed. But even then it has to cover all the alphabets of the world ! One of the first such standards, ascii (American Standard Code for Information Interchange), designed in 1969 for English usage, does not include characters using diacritical marks (like or ) or, a fortiori, the characters of the Arabic, Devanagari or Hangul alphabets. After ascii, several incompatible standards 6 were designed, none of them covering all alphabets of the world, until the advent of Unicode, presented later.

Once there is a standard for characters, it is still necessary for different formats to permit its use. Thus, in its infancy, e-mail on the Internet only allowed the use of ascii. The users of another alphabet could either renounce certain characters, this was the case for those using the Latin alphabet 7, or develop a system of transliteration of their alphabet into ascii 8. It took considerable discussion within ietf (Internet Engineering Task Force), the agency that manages standards for the Internet, to establish a standards document (an rfc : Request For Comments) to expand the 5 Supposing that one is written in Unicode and the other in ISCII.

6 And covering only the large alphabets. The small ones are those that benefit most from Unicode.

7 For example, in French, you can substitute composed characters by their ASCII equivalent and the text remains relatively readable in most cases.

8 Such as Japanese romaji, which in practice has hardly been successful.

Stphane Bortzmeyer range of character sets. Other character sets were accepted into e-mail in 1992 with the release of RFC 1341.

Another case where standards and multilingualism have stirred a hornets nest : identifiers. All internet protocols define legal identifiers as well as non-legal ones. As these identifiers are very visible (for example on advertisements, business cards, etc.), embarrassment can be considerable if they are misused. Thus, in mail addresses, such as stephane@bortzmeyer.org only ascii was permitted until the release of RFC 4952. Now (but beware, this rfc is still considered experimental ; only its successor, under development, will be a real standard), one can have addresses like stphane@bortzmeyer.org.

Another case that has generated much discussion 9 are the Unicode domain names, idn (International Domain Name). For various reasons 10, the names were traditionally restricted to us-ascii only. The idn standard of 2003, in RFC 3490, marked the beginning of names in Unicode, which have become commonplace today. Unlike other standards for internationalisation, they have been implemented very quickly in software and deployed in several registries of domain names.

A FIRST EXAMPLE OF A STANDARD : UNICODE Historically, all standard character sets were limited to one writing system, an alphabet, or a set of similar alphabets 11. One of the consequences of this Babel of character sets was that it was very difficult to write a text with multiple alphabets (a course in Hindi written in Spanish, for example).

Some character sets included a way to include ascii but not, in general, characters of the Latin alphabet outside the ascii set, not to mention other (non-Latin) alphabets. On the other hand, managing a collection of texts written in different alphabets (although each text has only one alphabet) was difficult. For example, for a web server, it would not be possible without Unicode, to configure a global parameter Charset, to indicate the character encoding for the entire site, even multilingual.

9 And not necessarily in a justified way.

10 Among which does not feature a DNS limit. This last has been accepting all characters from the beginning. But it is used in many contexts, some presenting a problem with names in Unicode.

11 For example, ISCII covers all the alphabets that have official status in India and that are derived from the Brhm script.

Stphane Bortzmeyer Unicode has changed all that : Unicode 12 is a character set that includes all the alphabets of the world 13. What is the content of the Unicode standard Firstly, Unicode is a list of characters. Compiling such a list doesnt seem like much, but in fact is a difficult task. Where some alphabets are highly standard it is sufficient to reuse this standard. In other cases, there is no official list. The authors of the standard must therefore ask themselves questions such as Does the German capital letter exist 14 or Are the Japanese and Chinese characters the same 15. Once we have answered these questions, we may publish the list 16. It currently consists of 109,characters, from the most mundane such as the Latin a to the most astonishing such as the sunrise over the mountains 17.

Once the list is established, Unicode gives each character a unique number, which facilitates communication : when two characters resemble each other visually, or when necessary fonts are not installed, this number enables an exchange without ambiguity. To give an example, the two characters cited above, a and sunrise over the mountains are respectively numbers U+0061 and U+1F304 18.

These numbers also serve as the basis for the subsequent encoding of these characters. In effect, we have to represent these characters as a sequence of bits in files or on a network. There are several methods for doing this, known by names such as UTF-8, UTF-32, etc., which all start from the number used, to represent it in a manner appropriate to certain uses. In practice, this last point only concerns computer programmers.

Just as technical, but perhaps more necessary to understand, are the concepts of canonicalisation : there are several ways to represent the same visual character in Unicode. For example, the in my name can be represented by U+00E9 (e with acute accent) or with U+0065 U+0301 (e 12 http://www.unicode.org 13 In November 2010, the current version of Unicode is 6.0 ; some alphabets are still missing but these are almost all dead alphabets, the lack of which affects only researchers.

14 No up to Unicode 5.2, yes afterwards.

15 The answer was yes, this is called the Han unifi cation and was certainly the most controversial decision of Unicode.

16 One of the important points in Unicode is that not only the text of the standard, but also the data such as lists of characters are publically distributed.

17 If you have the right confi guration, youll see it here :

http://www.fi leformat.info/info/unicode/char/1f304/index.htm 18 The numbers are conventionally preceded by U+ and written in hexadecimal.

Stphane Bortzmeyer followed by a combining acute accent). Current operations in computing such as comparison (imagine a user whose login name was the first name Stphane ) would fail if they were applied to Unicode characters naively. It is therefore necessary to canonicalise the strings of characters, reducing them to a canonical form. The most common standard for canonicalisation, on the internet (see RFC 5198) is known as nfc, and in the case presented above, would reduce all the to the form U+00E9.

So, who writes and maintains this standard The Unicode Consortium is a coalition of several organizations, including major companies in computing (Google, Apple, ibm, Microsoft, etc.). Recently, some nonprofit organizations have begun to address this problem and have joined the consortium. There is a very interesting list for public discussion, Unicode@ Unicode.org, but most of the work is done in private, only the results are public.

For those who want to deepen their understanding of Unicode, I recommend Unicode explained by Jukka Korpela (OReilly editor) for the authors of documents and Unicode demystified by Richard Gillam (AddisonWesley editor) for the programmers [ANDRIES 2008].

AN EXAMPLE OF SDO : IETF Lets take a detour through a sdo (Standards Developing Organization), a particularly important one, the ietf (The Internet Engineering Task Force).

This organization is, among others, responsible for e-mail standards, for the instant messaging protocol xmpp (eXtensible Messaging and Presence Protocol), the http protocol, the dns protocol, etc. One of the peculiarities of the ietf is its extreme openness : there is no formal membership, so no fee ; anyone, whether individual or company, can participate. If a member cannot travel to physical meetings (which are, themselves, quite expensive), it is not a big deal ; some ietf working groups have never met face to face 19. Even if the group meets physically, most of the work is done online, via public mailing lists (and publicly archived), and working papers also public. Wikileaks would not have much to do to ensure the transparency of the ietf :-) 19 Such as the working group LTRU, who created the language tags described below.

Stphane Bortzmeyer What is the ietf policy on multilingualism This is explained in rfc2277.

In two steps, the ietf separates the protocol elements, internal to the operation of the protocol, from the text that is shown to users. The former are visible only to programmers. Thus, a web browser requesting the resource /faq.html to a server sends the command GET /faq.html.

The verb GET is indeed derived from the English but it is not really an English word, rather an element of the http dialog. The ordinary user never sees it and so there is no reason to translate it. On the other hand, the text of the web page will be retrieved to be viewed by a user. Here, rfc2277 establishes how in principle it must be able to be encoded in any character set and should certainly not be restricted to ascii.

These are excellent principles, but obviously the reality is more complex.

Two cases are not directly addressed by this rfc, one, very sensitive, is the case of identifiers (such as domain names or email addresses listed above) which are both protocol elements and text read by users. Much of the controversy around the idn system (Internationalised Domain Names) for example comes from the clash of two points of view, those who see a domain name as a formal identifier, devoid of any semantics (and therefore can be written in a foreign alphabet to a user) and those who consider it an identity marker, which must be user-readable.

Note that the w3c, the organization responsible for the standardisation of web technologies, operates relatively closely to the ietf, and has a similar policy 20.

A SECOND EXAMPLE OF A STANDARD : EMAIL Email is one of the less visible and yet most used applications on the internet. Despite some predictions concerning what might be diverted to instant messaging, or to communication tools controlled by closed services such as Facebook, millions of messages continue to be exchanged every day. How does email manage internationalisation There are two separate problems, the content of messages and their addresses.

Previously, the only content that was accepted was plain text, exclusively in us-ascii. That changed in 1992 with the publication of the mime standard (Multipurpose Internet Mail Extensions). This allowed a message 20 http://www.w3.org/International/getting-started Stphane Bortzmeyer to contain instructions for formatting text, and also sound, images, etc.

Another aspect of mime was less noticed at the time : the character set of the text was no longer obliged to be us-ascii ; any character set was accepted, provided it was properly defined and that the receiver software could use it. Since then we can say that email standards enable writing messages in any language, but it has taken long enough for all software authors to adapt to them. Here we see that setting standards is only the first pillar of a language policy in cyberspace. Incentives for users, or programmers, to apply them are also part of an informed decision.

Until very recently, there was a lack : email addresses themselves were not internationalised. There was no question of putting on a business card stphane@coopration.com. This limitation shrank in 2008 with the experimental rfcs modeled on RFC 4952. Scarcely deployed at the moment, the possibility (of people using their own name in their own language and alphabet as mailbox names) should become more widespread with its forthcoming access as a standard. One can easily imagine the interest for writers using non-Latin alphabets, for which the transliteration of names would no longer be necessary.


LANGUAGE TAGS Much less well known, because less visible, than mime or idns, language tags are short identifiers used to indicate the language of a document. They are essential for librarians and archivists, linguists who exchange their documents, but also for authors of web sites when they want to indicate the language of a document, for presentation purposes (typographical rules are not the same for all languages ) or for research (to facilitate the work of a search engine when asked only documents in Portuguese).

A format such as xml allows the language of a document to be specified, avoiding editing software having to guess it, a complex and not always safe operation. While, today, language tags are unfortunately little used on the Web (both by the authors of pages, and by search engines), they are significantly present in large documentary catalogues.

Standardised in RFC 5646, language tags 21 can indicate not only the language but also the form of writing used, the national or regional 21 http://www.langtag.net Stphane Bortzmeyer variations, etc. Thus, if the label el is simply modern Greek, without more explanation, the more complex label yue-Latn-HK indicates Cantonese used in Hong Kong, and written in the Latin alphabet.

Pages:     | 1 |   ...   | 10 | 11 || 13 | 14 |   ...   | 51 |

2011 www.dissers.ru -

, .
, , , , 1-2 .