Abstract-This paper reports on a study undertaken to explore the problem of volume in searching large scale digital collections. An experiment is conducted using elimination terms as a method to reduce the number of non-relevant documents in the information retrieval (IR) result. The goal is to provide insight into how elimination terms can be used as a sorting method to reduce volume. The results of the experiment demonstrate that modifying the search structure with an elimination term component can significantly reduce the number of non-relevant documents in the retrieval set to address the problem of high volume in electronic document sorting and searching tasks.Index Terms-Information retrieval, knowledge discovery, search methods, document sorting.
I. INTRODUCTIONVolume is one of the "three Vs" associated with large scale digital information (volume, variety and veracity), [1]. In this paper we focus on the volume problem by asking the question: How can we apply sorting methods to reduce the volume of electronic documents for sorting and searching tasks?A current trending problem is that, given the large volumes of information contained in electronic stores, search tools need to support the retrieval of relevant documents from large collections without producing too many non-relevant documents [2]- [4].A relevant document is defined as a document that meets the user's needs (see VanRijsbergen, 1979). A non-relevant document is defined as a document that does not meet the user's needs. To address the problem of high volume in digital collections, an effective method for sorting documents is helpful to return relevant documents, and not return non-relevant documents, in the IR result [5].An additional concern is about retrieving documents that are unauthorized, such as privileged or private documents (treated as non-relevant for sorting purposes). Also, the high cost associated with manual review of documents, both paper and electronic, increases the priority of using an automated method for sorting relevant and non-relevant documents, and reducing the number of documents in the retrieval set that must ultimately be settled by human Manuscript received October 16, 2015; revised February 4, 2016. Harvey Hyman is with New College of Florida, 5800 Bay Shore Road, Sarasota, Florida 34243, USA (e-mail: hhyman@NCF.edu).Terry Sincich and Rick Will are with University of South Florida, 4202 E. Fowler Avenue, Tampa, Florida 33620, USA (e-mail: tsincich@USF.edu, rwill@usf.edu).Warren Fridy III is with H2 & WF3 Research, LLC 701 S. Howard Avenue, Tampa, Florida 33606, USA (e-mail: warren@h2wf3.com).review-the most expensive part of the process.Precision, which we define in this paper as the percentage of relevant documents returned in the search result, is the measure used in this reported experiment to determine efficiency in a search result. Our goal here is that, if precision can be improved, then a smaller collection, with a greater percentage of relevant documents, can be produced by the automated system for huma...