Proceedings of the Eighth International Conference on Information and Knowledge Management 1999
DOI: 10.1145/319950.319956
|View full text |Cite
|
Sign up to set email alerts
|

Extracting significant time varying features from text

Abstract: We propose a simple statistical model for the frequency of occurrence of features in a stream of text. Adoption of this model allows us to use classical significance tests to filter the stream for interesting events. We tested the model by building a system and running it on a news corpus. By a subjective evaluation, the system worked remarkably well: almost all of the groups of identified tokens corresponded to news stories and were appropriately placed in time. A preliminary objective evaluation was also use… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
55
0
1

Year Published

2002
2002
2020
2020

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 81 publications
(56 citation statements)
references
References 4 publications
0
55
0
1
Order By: Relevance
“…Intuitively, burst detection can be achieved by identifying a burst region where the data value exceeds certain threshold. The threshold can be determined based on heuristics [20], different data distribution assumptions [21] or statistical tests [22]. The Cumulative Sum (CUSUM) method [23] is one of the most popular statistical approach for change point detection.…”
Section: Related Workmentioning
confidence: 99%
“…Intuitively, burst detection can be achieved by identifying a burst region where the data value exceeds certain threshold. The threshold can be determined based on heuristics [20], different data distribution assumptions [21] or statistical tests [22]. The Cumulative Sum (CUSUM) method [23] is one of the most popular statistical approach for change point detection.…”
Section: Related Workmentioning
confidence: 99%
“…Significant solutions range from extracting time-varying features from texts (Swan and Allan, 1999) to constructing timelines for event classification based on word usage statistics (Swan and Jensen, 2000) and personalized newsfeeds based on information novelty (Gabrilovich et al, 2004). In the latter, the inter-and intra-document dynamics of documents is considered to model how information evolves over time from article to article, as well as within individual articles.…”
Section: Previous Researchmentioning
confidence: 99%
“…For example, concept drift in textual data streams can be identified by monitoring word frequencies (Swan and Allan 1999) and the formation of new word clusters (Hsiao and Chang 2008;Spinosa et al 2007). Kifer et al (2004) introduce a more generic approach that uses a two window paradigm to detect changes in feature distribution.…”
Section: Triggered Rebuildmentioning
confidence: 99%