2006
DOI: 10.1007/s10791-006-9004-6
|View full text |Cite
|
Sign up to set email alerts
|

Extending WHIRL with background knowledge for improved text classification

Abstract: Intelligent use of the many diverse forms of data available on the Internet requires new tools for managing and manipulating heterogeneous forms of information. This paper uses WHIRL, an extension of relational databases that can manipulate textual data using statistical similarity measures developed by the information retrieval community. We show that although WHIRL is designed for more general similarity-based reasoning tasks, it is competitive with mature systems designed explicitly for inductive classifica… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2007
2007
2013
2013

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 14 publications
(9 citation statements)
references
References 23 publications
0
8
0
Order By: Relevance
“…Because we have seen that we have a reasonably effective solution to the many‐candidates problem, we can use impostors to reduce the verification problem to the many‐candidates problem. The use of impostors as a background set is a well‐established practice in the speaker‐identification community (e.g., Reynolds, ) and has also been applied to information‐retrieval problems (Zelikovitz, Cohen, & Hirsh, ), but, as far as we know, has not been previously used for authorship attribution.…”
Section: The Impostors Methodsmentioning
confidence: 99%
“…Because we have seen that we have a reasonably effective solution to the many‐candidates problem, we can use impostors to reduce the verification problem to the many‐candidates problem. The use of impostors as a background set is a well‐established practice in the speaker‐identification community (e.g., Reynolds, ) and has also been applied to information‐retrieval problems (Zelikovitz, Cohen, & Hirsh, ), but, as far as we know, has not been previously used for authorship attribution.…”
Section: The Impostors Methodsmentioning
confidence: 99%
“…The semantics could be derived from a collection of much longer documents in a similar domain as the short texts [14], by sending the input short texts as queries to a search engine to retrieve a set of most relevant results [11] , or by exploiting external resources such as Wikipedia 1 [9] and WordNet 2 [5]. Once the auxiliary data is obtained, it is often used to expand the original texts, which are then processed by traditional text mining models.…”
Section: Introductionmentioning
confidence: 99%
“…Most existing approaches try to enrich the representation of a short text using additional semantics. The semantics could be derived internally from the short text collection [3], externally from a collection of much longer documents in a similar domain as the short texts [5], or from much larger external sources such as Wikipedia and WordNet [1,3,4]. In [1,4], the classification accuracy is significantly improved by enriching short text feature vector with relevant hidden topics derived from Wikipedia pages using topic model.…”
Section: Introductionmentioning
confidence: 99%