Abstract. The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using mgram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described.
Abstract- I. OVERVIEWThe Topic Detection and Tracking (TDT) research community investigates information retrieval methods for organizing a constantly arriving stream of news articles by the events that they discuss. TDT is explored in an open and cooperative evaluation sponsored by DARPA and run by NIST; the evaluations have run every year since 1998.One of the organization tasks included in TDT is topic detection, where systems cluster arriving stories into bins depending on the topic (event) being discussed. For example, stories that discuss the same bombing should be grouped together, but other bombings at the same or different locations should be grouped separately. Systems are typically required to process each story before considering the next, and do not have any knowledge of the topics (bins) that will be appearing in the news.In this paper, we describe our experience deploying a TDT detection system in two real-world applications, the unexpected changes we had to make in the research system for it to be usable in a real setting, and how those changes have resulted in substantive changes in the TDT evaluation program (starting with TDT 2004). The point of this paper is not to serve as an indictment of TDT, nor to criticize the evaluations of TDT that have taken place. Rather, this paper serves as a cautionary note for technology evaluation communities: highlighting the possibility of a mismatch between evaluation abstractions and the real world, and reinforcing the mantra that both evaluations and applications can benefit from a cyclic relationship between the two.TDT began as a technology development and evaluation program [1]. In the DARPA-sponsored TDT evaluations, detection systems are compared by their ability to put all stories in a single topic together. The official measure is a cost function that combines system miss and false alarm rates on a per-topic basis [8]. Currently, the best systems achieve about a 0.3 cost value, typified by one system that had a 28% miss rate and a 0.3% false alarm rate on a randomly selected topic.We initially fielded our TDT detection technology based on the best parameter values as determined by the formal evaluation. However, it quickly became apparent that the resulting clusters were of sufficiently poor quality that they could not be used: they were either too focused or-more commonly-far too broad. We also found that relationships between topics were less crisp than in the TDT evaluation data, and that algorithmic selection of topic granularity was almost never correct. For example, the system tended to group topics from the same geographical area rather than break them into events. These effects were very clear in both newswire and Web news environments.Our failure analysis of TDT technology has contributed to several changes in the TDT evaluation. For example, starting with TDT 2004, topics will be expected to overlap, to be hierarchical, and to be less rooted in a single "seminal" event. Also partly inspired by deploying the technology, other tasks in TDT ar...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.