Far from being a populist trend, the reinvestment and movement of resources maintained for “traditionals” through the internet are often the doing of cosmopolitan literati, if not expatriates. In the aforementioned cases of Facebook groups, we found that the founders and moderators/ facilitators were, respectively : a white Cameroonian, a Cameroonian expatriate to the United States, and a Cameroonian intellectual. It could not be otherwise, because of the simple fact that Facebook cannot be used on many of the outdated computer hubs in Cameroon. But beyond that, and as we can see in the codification of European popular culture 3 Ewondo is a language spoken in the southern part of Cameroon, especially in the capital, Yaounde.
Vassili Rivron by the folklorists of the nineteenth and twentieth centuries, the desire and ability to acquire technological resources and to transpose linguistic resources from one system to another, as from oral to writing, are socially determined and enshrine the decisive role of the “cosmopolitan elites” and especially those who “join the other side” (e.g. the diaspora). Such a detour is likely necessary to ensure the vitality of an oral language within the written context of new technologies.
BIBLIOGRAPHY [ABLS 2008] Abls, Marc, Anthropologie de la globalisation, Paris, Payot, 2008.
[AMSELLE 1999] Amselle, Jean-Loup et Mbokolo, Elikia (dir.), Au cur de l’ethnie, Paris, La Dcouverte, 1999.
[BOURDIEU 1994] Bourdieu, Pierre : « Esprits d’tat – Gense et structure du champ bureaucratique », In : Raisons Pratiques, Seuil, Paris, 1994, pp 99-135.
[GOODY 1994] Goody, Jack P., Entre l’oralit et l’criture, PUF, Paris, 1994.
[GUICHARD 2003] Guichard, Eric, « Ds the ’Digital Divide’ Exist », In : Globalization and its new divides : malcontents, recipes, and reform (dir. Paul van Seters, Bas de Gaay Fortman & Arie de Ruijter), Dutch University Press, Amsterdam, 2003.
[GUYER 2000] Guyer, Jane I., « La tradition de l’invention en Afrique quatoriale », Politique africaine, n°79, octobre 2000, pp.101-139.
[SAYAD 1985] Sayad, Abdelmalek, « Du message oral au message sur cassette : la communication avec l’absent », Actes de la recherche en sciences sociales, n°59, 1985, pp. 61-72.
[THIESSE 1999] Thiesse, Anne-Marie, La cration des identits nationales (Europe XVIIIeXXe sicle), Seuil (coll. Univers Historique), Paris, 1999.
[VAN VELDE 2006] Van Velde, Mark, A description of Eton : phonology, morphology, basic syntax and lexicon, thse de doctorat, 2006.
Vassili Rivron Vassili Rivron PANN YU MON & MADHUKARA PHATAK SEARCH ENGINES AND ASIAN LANGUAGES Although many search engines are available in the languages most used in the digital world, they do not work when dealing in less computerized languages. In recent years, the number of non-English resources on the Web has grown rapidly, especially in Asian languages. This article raises the difficulties faced by search engines in this situation.
Original article in English.
PANN YU MON holds a PhD from the Department of Management and Information Systems Engineering of Nagaoka University of Technology, Japan. Her research interests are indexing, archiving and Web requests.
MADHUKARA PHATAK holds a Bachelor of Engineering in computer science from JSSATE, India. His research interests are cloud computing and distributed systems.
n today’s world, search engines play a critical role in retrieving information from the borderless Web. Although many search engines are Iavailable in major languages, they are not functional when it comes to less computerised languages. Over recent years, the number of nonEnglish resources on the Web has been growing rapidly and it has been estimated that English is not the native language for more than 60 % of Web users. Even if the above numbers are not exact, it is clear that nonEnglish language pages and users cannot be ignored. Although current popular search engines work on non-English queries, it is just pattern matching, the sequence of symbols entered by users appears somewhere in the web document, but more sophisticated method, based on natural language analysis of the language 1, are not used (e.g. dealing with stemming, word breaking, stop words retrieval, etc.).
For the convenience of users speaking different languages, Google has developed more than 136 interface languages, and proposes around 180 local search engines. Among these only 20 % are dedicated to Asian languages. More than half of all web pages use Asian languages. Some articles were written about the difficulties of search engine queries in western languages ; but only a few articles were focused on Asian language queries. Our main aim was to discuss the additional problems faced in non-English web queries and to suggest techniques to improve the response of searching systems. In this paper, we study the difficulties met by search engines when they handle queries for Asian languages. We give examples using five different languages or language families : Indian, Malaysia, Myanmar, Indonesian and Thai.
1 See in this book : Joseph Mariani, How Language Technologies Support Multilingualism.
Pann Yu Mon & Madhukara Phatak INTRODUCTION Search engines have to crawl billions of web pages in order to index the constantly changing hypertext, which contains information in a variety of languages and all sorts of formats. The size of the Web is growing exponentially and the number of indexable pages 2 on the Web is considered to be near one hundred billions pages. It has become more and more difficult for search engines to keep an up-to-date and comprehensive search index, resulting in low precision and low recall rates. Users often find it difficult to search for useful and high quality information on the Web using general-purpose search engines, especially when searching for information on a specific topic or in a non-English language. Search engines should also support users who globally have different computer handling abilities, cultural backgrounds and most importantly, who speak different languages. The majority of current popular search tools supports only English and ignores the diacritics and special features of non-English languages. One search tool alone cannot be perfect for all languages. For that reason, search engines need to be localised in a local language. Many domain-specific or language-specific search engines have been built to facilitate more efficient searching in different areas. Thus, the main aim of this article is to figure out the different kinds of difficulties that search engines encounter when handling Asian languages queries.
Although comprehensive software tools enabling the creation of search engines exist, most of them cannot function with non-English languages such as various European, Asian and Middle Eastern languages.
The major modules of a Web search engines are – Crawler ;
– Natural Language Processing Module (nlp) ;
– Indexer ;
– Query Engine Module ;
– Ranking Engine Module.
A crawler is small program that browses the World Wide Web and downloads web pages. These programs are given a starting set of seed urls to visit, from which they copy the pages and identify other hyperlinked 2 The surface web (also known as the visible web or indexable web) is that portion of the World Wide Web that is indexed by conventional search engines.
Pann Yu Mon & Madhukara Phatak urls to be visited. In order to implement the Language Specific Search Engine, we need a small part of the World Wide Web which is related to our interest.
Figure 1. Architecture of General Web Search Engine In order to download only the interesting parts of the World Wide Web, specific crawling criteria are needed. The next step is html parsing. It is relatively easy. The work after parsing is the nlp processing tasks. NonEnglish web pages are complicated in this step. The scope of this module varies depending on the language. It includes transliteration, word tokenization, stemming, preprocessing on input compound words, stop words removal and so on.
The next module is an indexer module. This module extracts all the words from each page and records the url where each word occurs. The result is a generally very large mapping of urls to pages where a given word occurs. It includes the tasks such as transcoding, word breaking, stemming and stop words removal.
The query engine module is responsible for receiving and answering users search requests. The task given to the ranking engine is to sort results so that highly ranked results are presented at the top of the list. All search Pann Yu Mon & Madhukara Phatak engine modules have the same function for different languages except the nlp module, which depends on the specific features of the language.
PROBLEM IN EACH LANGUAGE In this section, we will explain the tasks of the nlp processing module.
These tasks may vary depending on each different language. Here, we will figure out the different kinds of task by giving examples based on different kinds of language families.
Encoding handling One issue that should be taken into account during indexing is the existence of different encodings for the Web documents. This is especially relevant in the case of Asian languages. Here, we give examples for Indian and Myanmar languages. Indian languages have different encodings for different languages. But for the Myanmar language, there are different kinds of encodings for just one language. Following are the detail explanation of each language.
Indian Language More than 95 % of Indian language content on the web is not searchable due to multiple encodings of web pages. Most of these encodings are incompatible and hence need some kind of standardisation for making the content accessible via a search engine 3.
Indic scripts are phonetic in nature. There are vowels and consonant symbols. The consonants become a syllable after the addition of a vowel sound to it. Further to compound the problem there are “compound syllables” also referred as ligatures. For instance, if we consider “tri” in “triangle”, there are three letters corresponding to three sounds “ta”, “ra”, “yi”. But in the case of Indic Scripts the three are built together to make a single compound consonant having a non-linear structure unlike Latin based languages 4.
3 See in this book : Stphane Bortzmeyer, Multilingualism and the Internet’s Standardisation.
4 Prasad Pingali, Jagadeesh Jagarlamudi, Vasudeva Varma, “WebKhoj : Indian language IR from Multiple Character Encodings”, In : WWW '06 Proceedings of the 15th international conference on World Wide Web, 2006. http://dl.acm.org/citation.cfm doid=1135777.Pann Yu Mon & Madhukara Phatak In India, many languages use the same script called Devanagari script.
So language detection becomes more complex. For example the query “” means “honey pot” in Hindi but means “rifle” in Oriya.
Webkhoj is a search engine, which gives users the choice to search in ten different Indian languages. The engine supports Hindi, Telugu, Tamil, Malayalam, Marathi, Kannada, Bengali, Punjabi, Gujarati, and Oriya. In order to search Indian language websites, Webkhoj transliterates all the encodings into one standard encoding (Unicode/ucs) and accepts the user’s queries in the same encoding and builds the search results.
Myanmar Language Myanmar language uses various encodings. The problem is that when a user puts the keywords in one specific encoding, the search engine searches for pages using the same encoding as the user query. So some pages that are written in different encodings may be missed. The problem being that some pages, which are relevant, may be excluded in the result due to different encodings.
Several alternative ucs/Unicode encodings have also been implemented to encode Mayanmar web pages by different groups of people. These can be divided into three groups.
Graphic encodings : Actually it has been pretending to be English (technically Latin 1 or Windows Code Page 1252) fonts and is substituting Myanmar glyphs to English Latin glyphs. This means that they are using the code point allocated for the Latin alphabet to represent Myanmar characters.
Partially followed ucs/Unicode encodings : These kinds of encoding have different mappings and none of these follows the Universal Codedcharacter Set (ucs) or Unicode standard. They partially follow the ucs/ Unicode standard but they are not yet supported by Microsoft and other major software vendors.
ucs/Unicode encodings : These fonts contain not only Unicode points and glyphs but also the Open Type Layout (otl) logic and rules.
Some Myanmar Web Pages are made by using the so-called Mixture Encoding Style format. It is the mixture of ucs/Unicode code points and html-entity, like ံ ြ ္ ƞ ္
Pann Yu Mon & Madhukara Phatak ္ ( ). These are coded in decimal value. For these kinds of web pages, the html-entity should be converted to ucs/ Unicode point by converting the decimal values to hexa-decimal values. Some of the web page publishing software automatically encodes Myanmar words in Mixture Encoding Style format. For this current popular search engines cannot search Myanmar words properly.
Word segmentation issues on input keyword Each language has its own characteristics for word segmentation, so special attention needs to be paid to the segmentation method in the indexing process. Appropriate segmentation is still a problem for search engines.
Myanmar Language Word segmentation is even harder in Asian Languages such as Chinese or Myanmar, since words are not segmented by spaces. Foo and Li (2004) conducted experiments to study the impact of Chinese word segmentation on Information Retrieval (ir) effectiveness. Accuracy varied from 0.34 to 0.47 (on a scale going from 0 to 1) depending on the segmentation method.
Материалы этого сайта размещены для ознакомления, все права принадлежат их авторам.
Если Вы не согласны с тем, что Ваш материал размещён на этом сайте, пожалуйста, напишите нам, мы в течении 1-2 рабочих дней удалим его.