Proceedings of the Fifth International Workshop on on Information Retrieval With Asian Languages 2000
DOI: 10.1145/355214.355235
|View full text |Cite
|
Sign up to set email alerts
|

On the use of words and n-grams for Chinese information retrieval

Abstract: In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams have been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carry out more experiments on different ways to segment documents and queries, and to combine words with n-grams. Our experiments show that a combination of the longest-matching algorithm with single characters… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
31
1

Year Published

2003
2003
2011
2011

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 49 publications
(35 citation statements)
references
References 6 publications
3
31
1
Order By: Relevance
“…Table 4 shows the test results on TREC Chinese corpus. The results on 1-gram and 1,2-gram are comparable to other researches [5,6]. Wu [9] applied suffix tree for Chinese information retrieval but he did not mention how to rank the documents retrieved.…”
Section: Experiments and Future Worksupporting
confidence: 61%
See 1 more Smart Citation
“…Table 4 shows the test results on TREC Chinese corpus. The results on 1-gram and 1,2-gram are comparable to other researches [5,6]. Wu [9] applied suffix tree for Chinese information retrieval but he did not mention how to rank the documents retrieved.…”
Section: Experiments and Future Worksupporting
confidence: 61%
“…
Abstract [5,6], word segmentation and its effect on information retrieval [3]. These studies show that using either words or n-grams leads to comparable performances.
…”
mentioning
confidence: 99%
“…It has been used, for example, with most European languages McNamee and Mayfield, 2004a;Savoy, 2003;Hollink et al, 2004;Vilares et al, 2006), whether Romance, Germanic or Slavic languages, and others like Greek, Hungarian and Finnish; it being particularly accurate for compounding and highly inflectional languages. Moreover, although n-grams have been successfully applied to many other languages such as Farsi (Persian) (McNamee, 2009), Turkish (Ekmekçioglu et al, 1996), Arabic (Khreisat, 2009;Darwish and Oard, 2002;Savoy and Rasolofo, 2002) and several Indian languages (Dolamic and Savoy, 2008), they are particularly popular and effective in Asian IR (Nie and Ren, 1999;Foo and Li, 2004;Nie et al, 2000;Kwok, 1997;Ogawa and Matsuda, 1999;Ozawa et al, 1999;Lee and Ahn, 1996;McNamee, 2002). The reason for this is the nature of these languages.…”
Section: The N-gram Based Approachmentioning
confidence: 99%
“…It is found that approaches using either characters (bigrams) or words can lead to comparable retrieval effectiveness [6,11,12]. In [14], it is further found that the retrieval effectiveness using a character-based language model is highly competitive to, and on several collections, is even higher than that using words and bigrams.…”
Section: Related Workmentioning
confidence: 95%
“…Two general families of approaches have been proposed in the literature: using characters (mainly character unigrams and bigrams) and using words. It has been found in several studies that it is beneficial to combine different types of index [5,6,11]. Indeed, while a word can represent precisely a meaning, the meaning can also be expressed by other words and characters.…”
Section: Introductionmentioning
confidence: 99%