Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2008
DOI: 10.1145/1390334.1390367
|View full text |Cite
|
Sign up to set email alerts
|

Enhancing text clustering by leveraging Wikipedia semantics

Abstract: Most traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome this problem, several methods have been proposed to enrich text representation with external resource in the past, such as WordNet. However, many of these approaches suffer from some limitations: 1) WordNet has limited coverage and has a lack of effective word-se… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
93
0
2

Year Published

2009
2009
2020
2020

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 154 publications
(97 citation statements)
references
References 13 publications
2
93
0
2
Order By: Relevance
“…We performed lemmatisation, part of speech annotation, named entity tagging, and dependency parsing using Stanford CoreNLP toolkits . We used the Jan. 30, 2010 English version of Wikipedia and processed it according to the method described by Hu et al (2008).…”
Section: Experimental Settingsmentioning
confidence: 99%
“…We performed lemmatisation, part of speech annotation, named entity tagging, and dependency parsing using Stanford CoreNLP toolkits . We used the Jan. 30, 2010 English version of Wikipedia and processed it according to the method described by Hu et al (2008).…”
Section: Experimental Settingsmentioning
confidence: 99%
“…Examples include information retrieval [4,14,18], named entity disambiguation [1,2,7,8,11,12], text classification [25] and entity ranking [10]. To extract the content of an entity context, many researches directly used the Wikipedia article describing the entity [1,2,8,9,14,[25][26][27]; some works extended the article with all the other Wikipedia articles linked to the Wikipedia article describing the entity [6,7,12]; while some only considered the first paragraph of the Wikipedia article describing the entity [2]. Different from these approaches, our Graph-based approach not only employs in-links and languagelinks to broaden the article set that is likely to mention the entity, but also performs a finer-grained process: extracting the sentences that mention the entity, such that all the sentences in our context are closely related to the target entity.…”
Section: Related Workmentioning
confidence: 99%
“…As to the context-based representation vector of the entity, [1,11] defined it as the tf-idf/word count/binary occurrence values of all the vocabulary words in the context content; [2,19] defined it as the word count/binary occurrence values of other entities in the context content; [5,6,9,14,25] defined it as the tf-idf similarity values between the target entity's context content and other entities' context contents from Wikipedia; [27] defined it as the visiting probability from the target entity to other entities from Wikipedia; [7,26] used a measurement based on the common entities linked to the target entity and other entities from Wikipedia. Different from all former researches, we employ aspect weights that have a different interpretation of the frequency and selectivity than the typical tf-idf values and take co-occurrence and language specificity of the aspects into account.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Various approaches have been proposed [3,15,5]. We take the same route as [9], and use Wikipedia's vocabulary of anchor texts to connect words and phrases to Wikipedia articles.…”
Section: Selecting Relevant Wikipedia Conceptsmentioning
confidence: 99%