This paper gives an overview and an evaluation of Web pages of Asian languages on the Web, in particular of those languages that have not been focused on so far. The authors have collected over 100 million Asian Web pages downloaded from 42 Asian country domains, identified the languages based on Ngram statistics and analyzed their language properties. Primarily the number of pages written in each language measures the presence of a language. The survey reveals that the digital language divide exists at a serious level in the region. The state of multilingualism and the dominating presence of cross-border languages, English in particular, are analyzed. The paper sheds light on script and encoding issues of Asian language texts on the Web. In order to promote language resource collection and sharing, authors have a vision of creating an observation-collection instrument for Asian language resources on the Web. The results of the survey show the feasibility of this vision, and provide us with a better idea of the steps needed to realize that vision.
Language identification technology is widely used in the domains of machine learning and text mining. Many researchers have achieved excellent results on a few selected European languages. However, the majority of African and Asian languages remain untested. The primary objective of this research is to evaluate the performance of our new n-gram based language identification algorithm on 68 written languages used in the European, African and Asian regions. The secondary objective is to evaluate how n-gram orders and a mix n-gram model affect the relative performance and accuracy of language identification. The n-gram based algorithm used in this paper does not depend on the n-gram frequency. Instead, the algorithm is based on a Boolean method to determine the output of matching target n-grams to training n-grams. The algorithm is designed to automatically detect the language, script and character encoding scheme of a written text. It is important to identify these three properties due to the reason that a language can be written in different types of scripts and encoded with different types of character encoding schemes. The experimental results show that in one test the algorithm achieved up to 99.59% correct identification rate on selected languages. The results also show that the performance of language identification can be improved by using a mix n-gram model of bigram and trigram. The mix n-gram model consumed less disk space and computing time, compared to a trigram model.
The "digital divide" is the gap in technology usage and access. The digital divide has been investigated by scholars [1] and policy makers [2] mainly as an economy-specific issue that permeates the population across all demographic profiles, such as income, gender, age, education, race, and region, but not specific to the languages of different communities. The lack of native language driven ICT is a major conducive factor in digital divide.Sinhala writing system used in Sri Lanka is a syllabic writing system derived from Brahmi which consist of vowels, consonants, diacritical marks and special symbols constructs. Several of these constructs are combined to form complex ligatures. The total number of different glyphs is almost close to 2300 in Sinhala language. Thus, all computer equipments that support Sinhala language needs to support a greater degree of complexity in both display and printing with near minimal changes to the keyboard or the input systems. In this paper we discuss (1) historical background of the Sinhala writing system, (2) Sinhala scripts' characteristics and complexities and illustrate (3) how Sinhala computing technology has evolved over the last quarter century. Major steps are marked by the design of character code standards as a corner stone of whole architecture for text processing. A case described in this article of "Digital Inclusion" shows how small communities of non-Roman script users can connect to the Romanized system dominated cyberspace.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.