Similarly we compare English queries with one of the Asian languages (Myanmar). In the case of English language search, it is the common practice for search engines to return those pages which contain a sub-set of the words used in the user’s query. For instance, when a user enters the words “chocolate ice-cream”, the search engine returns not only the web pages that includes the exact phrase “chocolate ice-cream” but also returns those pages with the words “chocolate” and “ice-cream” alone. It is made possible by tokenization of the input query. In contrast to this, when searching for Myanmar words in a general search engine, it behaves like a “phrase search” in English language search. That is equivalent of putting double quotes (“ ”) around the query, telling search engines to consider the exact words in the exact order without any changes. For example ; document A contains a Myanmar compound word XYZ. And another document B contains every components of the word XYZ in non-consecutive manner, like “X…Y…Z”. If a user searches a query XYZ, the search engine retrieves document A, but does not retrieve document Pann Yu Mon & Madhukara Phatak B, because word segmentation is not done by the search engine. That’s why special treatment is needed for the Myanmar language in a search engine. It requires the word segmentation process both at the indexing stage and in the input keyword processing stage.
For the majority of languages such as Chinese or Japanese, it performs the work breaking process properly. In the following, the authors want to give a comparison between the operation of search engines on majority and minority languages. We use Japanese as an example for majority languages and Myanmar as a minority language.
Figure 2. Manner of Search Engine on Majority Languages Pann Yu Mon & Madhukara Phatak Figure 3. Operation of Search Engines on Minority Languages Figure 4. Google Result for keyword “ ” As shown in figure 2, for the majority languages, search engines return the exact phrase as first priority. For the second priority, the web pages that included segmented words are retrieved. It is shown that search engines work properly for the majority of languages.
Pann Yu Mon & Madhukara Phatak Figure 3 shows the operation of a search engine on a minority language.
When a user searches for “”, the result should be the pages that include “” or “”. Instead, it searched for the words that are exactly the same as the input query. If there are no pages that included the exact words in the exact order, they simply say that “Your search “” did not match any documents” as shown in Figure 4. It is obvious that most general search engines do not implement the natural language processing task for minority languages.
Thai Language A recent study on the Web Search Engines on Thai queries shows that word segmentation is still challenging. According to “Evaluation of Web Search Engine with Thai Queries” by Virach Sornlertlamvanich et.al. 5, when the word “” is submitted to several search engines, most of them returned results that do not contain that word. This shows that most search engine do not handle Thai language word segmentation properly.
Compound words Compound words are another major issue for the indexing of web pages in major search engines. Some Asian languages make extensive use of compound words. Below we give the example of compound words in the Myanmar language as well as the Indian Kannada, and Thai languages.
Myanmar Language In Myanmar ; two simple words, (rod, stick etc.) combine (lead) and together to form a compound word (pencil). Similarly, (healthy) combines with (happy) to form (healthy and happy). Although compound words are widely seen in every language and are not a specific feature of the Myanmar language, they differ greatly from one language to another.
5 Virach Sornlertlamvanich, Shisanu Tongchim and Hitoshi Isahara “Evaluation of Web Search Engines with Thai Queries” Proceedings of Workshop on NTCIR-6 and EVIA-1, NII, National Center of Sciences, Tokyo, Japan, May 15-18, 2007. http://research.nii.ac.jp/ ntcir/ntcir-ws6/OnlineProceedings/EVIA/15.pdf Pann Yu Mon & Madhukara Phatak Indian Language Similarly with the Indian Kannada language, two simple words “” (good), and “” (thinking) combine to become a compound word “” (good thinker).
Thai Language According to the study made by Virach Sornlertlamvanich et. al. (op.
cit.) some Thai queries are indivisible units although each query can be considered as a set of words. For example, a query found in the query log is “” which is the “Thai Meteorological Department”.
This word can be considered as two words : “” (Department) and “” (Meteorology). Since this word represents a unique entity, it may be recognized as an indivisible unit. There are also some queries that resemble to this word, but they are ill-written. For example, at least three queries that can be considered to refer to this word : “”, “ ” and “”. The use of these keywords usually leads to websites that have improper forms of the word “”, rather than the website of the “Thai Meteorological Department”.
Pre-processing of the different writing system Different writing systems for the same word may occur. Here, we would like to give an example in the Myanmar language.
Myanmar Language The Myanmar writing system has been strongly influenced by Pali and Sanskrit. In ancient times, words were written on a piece of stone in subscripted-form because of the limitations of space availability. Later, some of the subscripted-words had been changed to expanded-form, but they can still be written in subscripted form. Everybody can write those words freely, as convenient.
More complex, some Myanmar words can be written in different forms without omitting any character in both forms even though they do not belong to the same consonant group. The two forms have exactly the same meaning, the same pronunciation but different representations. An Pann Yu Mon & Madhukara Phatak example is “” in subscripted form. It can be expanded as “” (rice). Similarly, the word “” is sometimes an abbreviation of “” (daughter). These words are not found as native Myanmar words, except for the purpose of abbreviation.
Hence, if those kinds of words are given as a query in a search engine, the expanded form should be treated as a phrase.
Stemming To help search engines retrieval processes become more effective, one of the practices used is word stemming. In this case, morphological variants are different from one language to other.
Indonesian Language The Indonesian language is morphologically rich. There are around thirty five standard affixes (prefixes, suffixes, circumfixes, and some infixes inherited from Javanese) 6. In Indonesian language, affixes can be attached to virtually any word and they can be iteratively combined. The wide use of affixes seems to have created a trend among Indonesian speakers to invent new affixes and affixation rules 7.
Malay affixes consist of four different types, which are the prefix, suffix, prefix-suffix pair, and infix. Unlike an English stemmer which works quite well just by removing suffixes alone to obtain the stems, an effective and powerful Malay stemmer must not only be able to remove suffixes, but also prefixes, prefix-suffix pairs, and infixes as well 8. Without removing all these affixes, stems cannot be efficiently used to index Malay documents.
Myanmar Language In the case of the Myanmar language, stemming focuses on the removal of inflectional suffixes, derivational suffixes, inflectional prefixes and derivational prefixes for a given Myanmar word. According to Myanmar 6 Kridalaksana, Harimurti., Pembentukan Kata Dalam Bahasa Indonesia. P.T. Gramedia, Jakarta 1989.
7 Tim Penyusun Kamus, Kamus Besar Bahasa Indonesia. 2ed. Balai Pustaka, 1999.
8 F. Ahmad, A Malay Language Document Retrieval System : An Experimental Approach and Analysis, Universiti Kebangsaan Malaysia, Bangi, 1995.
Pann Yu Mon & Madhukara Phatak grammar book, it seems there are ninety one different affixes for the main four word classes : verb, noun, adjective and adverb. A Myanmar language stemmer involves the straightforward removal of the affixes to get the correct stem. More details on a Myanmar language stemmer are given in one research thesis 9.
The stemming algorithm is different for each language. The search engine should pay attention to each individual language.
CONCLUSION Our conclusions are that it would be more effective if search engines took more account of the properties of individual languages, and that there is a need for more studies of real user behaviour in practical situations.
As the Web continues to become more multilingual, and as languages other than English continue to gain ground on the Web, the need to develop search engines to handle all these languages has become more apparent. It might be unrealistic at this point to suggest that search engines should be equipped with an entire linguistic toolkit capable of handling all languages, but a gradual progress towards this goal is not far-fetched.
The developers of search engines should start looking at implementing the basic requirements of truly multilingual engines. Web pages published in languages that do not share the linguistic characteristics of English are more likely to be missed or improperly indexed by major search engines than English web pages.
Overall, it can be argued that the processing and searching of non-English text poses additional difficulties that are not faced in English texts. Search engines need to be localised, in a local language.
9 San Ko Oo, Yoshiki Mikami, Development of Myanmar Language Stemmer, Master thesis of Management of Information system engineering department, Nagaoka University of Technology, Japan, 2010.
Pann Yu Mon & Madhukara Phatak Pann Yu Mon & Madhukara Phatak HERV LE CROSNIER DIGITAL LIBRARIES How can we preserve cultures and traces of different languages in digital libraries How can we add value through many translations of the same work so that users, especially the young, can understand the diversity and wealth of human thought How to participate locally, with one’s own language and culture, in the construction of a huge interconnected library, offering everyone access to the works of the entire world Original article in French.
Translated by Laura Kraftowitz.
HERV LE CROSNIER is a senior lecturer at the University of Caen Basse-Normandie, where he teaches Internet technologies and digital culture. He is currently working with ISCC, the Institute for Communication Sciences of the CNRS. His research focuses on the impact of the Internet on social and cultural organization, and extending knowledge in the public domain. He is one of the founders of C&F ditions.
rom the Egyptian papyrus collections, to the clay tablets of Mesopotamia covered with cuneiform writing, whenever and wherFever knowledge and culture have been able to be transcribed onto a carrier, documents have been found. The famous Library of Alexandria’s loss to flames remains a founding moment for those invested in the transmission of knowledge so that future generations can benefit from the advances of the preceding ones. The desire to collect documents, organise them and make them available is one of the main preoccupations of scholars. When Europe witnessed the birth of movable type printing, the number of documents available to the public grew rapidly, which catalysed book exchange, and the need was perceived for legal deposit libraries, as a way to store and accumulate knowledge constituted in this way. The first sound recordings, reels, and later disks from the beginning of the twentieth century, were deposited into audio archives.
For oral languages, use and conservation go hand in hand : they are transmitted directly by their speakers. The entry of oral-only languages into libraries is recent, and owes its existence to the spread of audio and video recording 1. Within such spaces, multimedia technology encounters writing cultures whose existence and history have endured by being engraved onto a medium, and can inherit the organizational savoir-faire that has been constructed around the book.
Storing, organising and making available all records are the founding pillars of the institution of the library, which, by adhering to these fundamentals, has been able to incorporate each new process of recording knowledge and emotions.
1 See in this book : Tunde Adegbola, Multimedia and Signed, Written or Oral Languages.
Herv Le Crosnier Today, documents are witnessing an essentially “digital moment”, which is changing our approach towards their durability and transmission. Word processing is eclipsing the manuscript. Web pages are often regarded as services offering news or commerce, feedback and comments, too often in a continuous flow that moves contrarily to the accumulation logic of libraries. On the other hand, performances (concerts, local and global events, readings), and even quotidian life (through the proliferation of digital photography and home video) are recorded. The range of media is growing, allowing the traces left by cultural and scientific activities to be made permanent and transformed into audio, video or multimedia recordings.
Digital libraries are at this new crossroads, between the proliferation of new documents linked to digital media’s ease of production and distribution, and traditional libraries. In the present discussion, I will attempt to define digital libraries as distinguished from other forms of document access, and evaluate approaches and needs as they relate to multilingualism. Finally, I will expand upon the legal and technical constraints, as well as new cultural practices, that frame digital library activity.
LIBRARIES AND ARCHIVES Traditionally, we have distinguished between three types of organizations :
• Libraries preserve and make available “duplicates”, that is, existing works in multiple copies that have been publicly released, usually by a publisher, and sometimes via reprography for reports or scientific papers (dissertations, “grey literature”). Before finding their place in the library, a selection for content is made by the editorial circuit ;
the materials entering the library are thus broadly homogenous in their approach to the writer-reader relationship, and by the industrial circuit facilitating the transmission of the work of the former towards the latter ;