Chew Yew Choong scite author profile

Chew Yew Choong

2Publications

6Citation Statements Received

6Citation Statements Given

How they've been cited

How they cite others

Affiliations

Publications

Order By: Most citations

An Analysis of Asian Language Web Pages

Nandasara

Kodama²,

Choong³

et al. 2009

Int J on Adv. in ICT for Emerging Countries

View full text Add to dashboard Cite

This paper gives an overview and an evaluation of Web pages of Asian languages on the Web, in particular of those languages that have not been focused on so far. The authors have collected over 100 million Asian Web pages downloaded from 42 Asian country domains, identified the languages based on Ngram statistics and analyzed their language properties. Primarily the number of pages written in each language measures the presence of a language. The survey reveals that the digital language divide exists at a serious level in the region. The state of multilingualism and the dominating presence of cross-border languages, English in particular, are analyzed. The paper sheds light on script and encoding issues of Asian language texts on the Web. In order to promote language resource collection and sharing, authors have a vision of creating an observation-collection instrument for Asian language resources on the Web. The results of the survey show the feasibility of this vision, and provide us with a better idea of the steps needed to realize that vision.

show abstract

Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages

Choong¹,

Mikami²,

Marasinghe

et al. 2009

Int J on Adv. in ICT for Emerging Countries

View full text Add to dashboard Cite

Language identification technology is widely used in the domains of machine learning and text mining. Many researchers have achieved excellent results on a few selected European languages. However, the majority of African and Asian languages remain untested. The primary objective of this research is to evaluate the performance of our new n-gram based language identification algorithm on 68 written languages used in the European, African and Asian regions. The secondary objective is to evaluate how n-gram orders and a mix n-gram model affect the relative performance and accuracy of language identification. The n-gram based algorithm used in this paper does not depend on the n-gram frequency. Instead, the algorithm is based on a Boolean method to determine the output of matching target n-grams to training n-grams. The algorithm is designed to automatically detect the language, script and character encoding scheme of a written text. It is important to identify these three properties due to the reason that a language can be written in different types of scripts and encoded with different types of character encoding schemes. The experimental results show that in one test the algorithm achieved up to 99.59% correct identification rate on selected languages. The results also show that the performance of language identification can be improved by using a mix n-gram model of bigram and trigram. The mix n-gram model consumed less disk space and computing time, compared to a trigram model.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Chew Yew Choong

An Analysis of Asian Language Web Pages

Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages

Contact Info

Product

Resources

About