An Effective Pre-Processing Algorithm for Information Retrieval Systems

Singh, Vikram; Saini, Balwinder

doi:10.5121/ijdms.2014.6602

Cited by 15 publications

(3 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first component of EpiNews deals with the preprocessing of HealthMap articles through a series of preprocessing steps, such as removal of non-textual elements, tokenization2829, lemmatization30 and removal of stop words via BASIS Technologies’ Rosette Language Processing (RLP) tools3132. For more details on these steps, see Supplementary Section ‘HealthMap preprocessing’.…”

Section: Methodsmentioning

confidence: 99%

“…Tokenization and lemmatization. Tokenization 25,26 is the process of segmenting a textual content into words, phrases, symbols or other meaningful elements commonly referred to as tokens. Lemmatization 27 is performed after tokenization and can be defined as the normalization process in which various inflected forms of a word are converted to the same underlying lemma so that they can be analyzed as a single term.…”

Section: /21mentioning

confidence: 99%

“…We translated the textual content of these articles to English for ease of analysis. The articles were preprocessed by removing non-textual elements, tokenization 25,26 , lemmatization 27 and removal of stop words via BASIS Technologies' Rosette Language Processing (RLP) tools 28,29 . For more details on these steps, see subsection 'HealthMap preprocessing' within the section 'Supplementary Information' at the end of the manuscript.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Temporal Topic Modeling to Assess Associations between News Trends and Infectious Disease Outbreaks

Ghosh

Chakraborty

Nsoesie

et al. 2017

Sci Rep

View full text Add to dashboard Cite

In retrospective assessments, internet news reports have been shown to capture early reports of unknown infectious disease transmission prior to official laboratory confirmation. In general, media interest and reporting peaks and wanes during the course of an outbreak. In this study, we quantify the extent to which media interest during infectious disease outbreaks is indicative of trends of reported incidence. We introduce an approach that uses supervised temporal topic models to transform large corpora of news articles into temporal topic trends. The key advantages of this approach include: applicability to a wide range of diseases and ability to capture disease dynamics, including seasonality, abrupt peaks and troughs. We evaluated the method using data from multiple infectious disease outbreaks reported in the United States of America (U.S.), China, and India. We demonstrate that temporal topic trends extracted from disease-related news reports successfully capture the dynamics of multiple outbreaks such as whooping cough in U.S. (2012), dengue outbreaks in India (2013) and China (2014). Our observations also suggest that, when news coverage is uniform, efficient modeling of temporal topic trends using time-series regression techniques can estimate disease case counts with increased precision before official reports by health organizations.

show abstract