Building a test collection for complex document information processing

Lewis, David; Agam, Gady; Argamon, Shlomo; Frieder, Ophir; Grossman, David A.; Heard, Jeff

doi:10.1145/1148170.1148307

Cited by 204 publications

(129 citation statements)

References 4 publications

Supporting

Mentioning

129

Contrasting

Order By: Relevance

“…IIT CDIP was created at the Illinois Institute of Technology (Lewis et al 2006; ) and is based on documents released under the Master Settlement Agreement (MSA) between the Attorneys General of several U.S. states and seven U.S. tobacco companies and Evaluation of IR for E-discovery 367 institutes. 35 The University of California San Francisco (UCSF) Library, with support from the American Legacy Foundation, has created a permanent repository, the Legacy Tobacco Documents Library (LTDL), for tobacco documents (Schmidt et al 2002), of which IIT CDIP is a cleaned up snapshot generated in 2005 and 2006.…”

Section: The Iit Cdip Collectionmentioning

confidence: 99%

Evaluation of information retrieval for E-discovery

Oard

Baron

Hedin³

et al. 2010

Artif Intell Law

View full text Add to dashboard Cite

The effectiveness of information retrieval technology in electronic discovery (E-discovery) has become the subject of judicial rulings and practitioner controversy. The scale and nature of E-discovery tasks, however, has pushed traditional information retrieval evaluation approaches to their limits. This paper reviews the legal and operational context of E-discovery and the approaches to evaluating search technology that have evolved in the research community. It then describes a multi-year effort carried out as part of the Text Retrieval Conference to The first three sections of this article draw upon material in the introductory sections of two papers presented at events associated with the 11th and 12th International Conferences on Artificial Intelligence and Law (ICAIL) (Baron and Thompson 2007; Zhao et al. 2009) as well as material first published in (Baron 2008), with permission.develop evaluation methods for responsive review tasks in E-discovery. This work has led to new approaches to measuring effectiveness in both batch and interactive frameworks, large data sets, and some surprising results for the recall and precision of Boolean and statistical information retrieval methods. The paper concludes by offering some thoughts about future research in both the legal and technical communities toward the goal of reliable, effective use of information retrieval in E-discovery.

show abstract

Section: The Iit Cdip Collectionmentioning

confidence: 99%

Evaluation of information retrieval for E-discovery

Oard

Baron

Hedin³

et al. 2010

Artif Intell Law

View full text Add to dashboard Cite

show abstract

“…25 non-distorted images in this dataset are taken from two freely available datasets -University of Washington Dataset [5] and Tobacco Database [9]. For each document, multiple photos were taken from a fixed distance to capture the whole document, but the camera was focused at varying distance to generate a series of images with focal blur.…”

Section: Datasetmentioning

confidence: 99%

Real-Time No-Reference Image Quality Assessment Based on Filter Learning

Kumar

Kang

et al. 2013

2013 IEEE Conference on Computer Vision and Pattern Recognition

View full text Add to dashboard Cite

This paper addresses the problem of general-purposeNo-Reference Image Quality Assessment (NR-IQA) with the goal of developing a real-time, cross-domain model that can predict the quality of distorted images without prior knowledge of non-distorted reference images and types of distortions present in these images. The contributions of our work are two-fold: first, the proposed method is highly efficient. NR-IQA measures are often used in real-time imaging or communication systems, therefore it is important to have a fast NR-IQA algorithm that can be used in these real-time applications. Second, the proposed method has the potential to be used in multiple image domains. Previous work on NR-IQA focus primarily on predicting quality of natural scene image with respect to human perception, yet, in other image domains, the final receiver of a digital image may not be a human.The proposed method consists of the following components: (1) a local feature extractor; (2) a global feature extractor and (3) a regression model. While previous approaches usually treat local feature extraction and regression model training independently, we propose a supervised method based on back-projection, which links the two steps by learning a compact set of filters which can be applied to local image patches to obtain discriminative local features. Using a small set of filters, the proposed method is extremely fast. We have tested this method on various natural scene and document image datasets and obtained stateof-the-art results.

show abstract

“…The collection used for the experiments is the Complex Document Information Processing (CDIP) test collection [6]. CDIP includes 7 million scanned documents and over 42 million pages, received from tobacco company lawsuits.…”

Section: Cdip Tobacco Datasetmentioning

confidence: 99%

Scalable ranked retrieval using document images

2013

View full text Add to dashboard Cite

Despite the explosion of text on the Internet, hard copy documents that have been scanned as images still play a significant role for some tasks. The best method to perform ranked retrieval on a large corpus of document images, however, remains an open research question. The most common approach has been to perform text retrieval using terms generated by optical character recognition. This paper, by contrast, examines whether a scalable segmentation-free image retrieval algorithm, which matches sub-images containing text or graphical objects, can provide additional benefit in satisfying a user's information needs on a large, real world dataset. Results on 7 million scanned pages from the CDIP v1.0 test collection show that content based image retrieval finds a substantial number of documents that text retrieval misses, and that when used as a basis for relevance feedback can yield improvements in retrieval effectiveness.

show abstract

Building a test collection for complex document information processing

Cited by 204 publications

References 4 publications

Evaluation of information retrieval for E-discovery

Evaluation of information retrieval for E-discovery

Real-Time No-Reference Image Quality Assessment Based on Filter Learning

Scalable ranked retrieval using document images

Contact Info

Product

Resources

About