Managing Gigabytes: Compressing and Indexing Documents and Images

Witten, Ian H.; Moffat, Alistair; Bell, Timothy C.

doi:10.1109/tit.1995.476344

Cited by 700 publications

(1,219 citation statements)

References 1 publication

Supporting

Mentioning

1,199

Contrasting

Unclassified

Order By: Relevance

“…In our approach, we use an inverted file [33] to index images. The inverted index consists of two components: one includes indexed visual words and visual phrases, and the other includes vectors containing the information about the spatial weighting of the visual words and the occurrence of the visual phrases.…”

Section: Image Indexingmentioning

confidence: 99%

Toward a higher-level visual representation for content-based image retrieval

Sayad

Martinet

Urruty

et al. 2010

Multimed Tools Appl

View full text Add to dashboard Cite

Having effective methods to access the desired images is essential nowadays with the availability of a huge amount of digital images. The proposed approach is based on an analogy between content-based image retrieval and text retrieval. The aim of the approach is to build a meaningful mid-level representation of images to be used later on for matching between a query image and other images in the desired database. The approach is based firstly on constructing different visual words using local patch extraction and fusion of descriptors. Secondly, we introduce a new method using multilayer pLSA to eliminate the noisiest words generated by the vocabulary building process. Thirdly, a new spatial weighting scheme is introduced that consists of weighting visual words according to the probability of each visual word to belong to each of the n Gaussian. Finally, we construct visual phrases from groups of visual words that are involved in strong association rules. Experimental results show that our approach outperforms the results of traditional image retrieval techniques.

show abstract

Section: Image Indexingmentioning

confidence: 99%

Toward a higher-level visual representation for content-based image retrieval

Sayad

Martinet

Urruty

et al. 2010

Multimed Tools Appl

View full text Add to dashboard Cite

show abstract

“…Gap compression [22] is effective when the gaps between sorted docIDs are small. To reduce the gap size, we propose to periodically remap docIDs from 160-bit hashes to dense numbers from 1 to the number of documents.…”

Section: Gap Compressionmentioning

confidence: 99%

On the Feasibility of Peer-to-Peer Web Indexing and Search

Li¹,

Loo

Hellerstein

et al. 2003

Lecture Notes in Computer Science

204

180

View full text Add to dashboard Cite

This paper discusses the feasibility of peer-to-peer full-text keyword search of the Web. Two classes of keyword search techniques are in use or have been proposed: flooding of queries over an overlay network (as in Gnutella), and intersection of index lists stored in a distributed hash table. We present a simple feasibility analysis based on the resource constraints and search workload. Our study suggests that the peer-to-peer network does not have enough capacity to make naive use of either of search techniques attractive for Web search. The paper presents a number of existing and novel optimizations for P2P search based on distributed hash tables, estimates their effects on performance, and concludes that in combination these optimizations would bring the problem to within an order of magnitude of feasibility. The paper suggests a number of compromises that might achieve the last order of magnitude. Comments Postprint version. Published in AbstractThis paper discusses the feasibility of peer-to-peer full-text keyword search of the Web. Two classes of keyword search techniques are in use or have been proposed: flooding of queries over an overlay network (as in Gnutella), and intersection of index lists stored in a distributed hash table. We present a simple feasibility analysis based on the resource constraints and search workload. Our study suggests that the peer-to-peer network does not have enough capacity to make naive use of either of search techniques attractive for Web search. The paper presents a number of existing and novel optimizations for P2P search based on distributed hash tables, estimates their effects on performance, and concludes that in combination these optimizations would bring the problem to within an order of magnitude of feasibility. The paper suggests a number of compromises that might achieve the last order of magnitude.

show abstract

“…Output-sensitive data structures are at the heart of text searching [13], geometric searching [5], database searching [28], and information retrieval in general [3,31]. They are the result of preprocessing n items (these can be textual data, geometric data, database records, multimedia, or any other kind of data) into O(n polylog(n)) space in such a way, as to allow quickly answering on-line queries in O(t(n) + ℓ) time, where t(n) = o(n) is the cost of querying the data structure (typically t(n) = polylog(n)).…”

Section: Introductionmentioning

confidence: 99%

“…While ranking itself has been the subject of intense theoretical investigation in the context of search engines [17,18,24], we could not find any explicit study pertaining to ranking in the context of data structures. The only published data structure of this kind is the inverted lists [31] in which the documents are sorted according to their rank order. McCreight's paper on priority search trees [19] refers to enumeration in increasing order along the yaxis but it does not indeed discuss how to report the items in sorted order along the y-axis.…”

Section: Introductionmentioning

confidence: 99%

Rank-Sensitive Data Structures

Bialynicka-Birula

Grossi

2005

String Processing and Information Retrieval

View full text Add to dashboard Cite

Abstract. Output-sensitive data structures result from preprocessing n items and are capable of reporting the items satisfying an on-line query in O(t(n) + ℓ) time, where t(n) is the cost of traversing the structure and ℓ ≤ n is the number of reported items satisfying the query. In this paper we focus on rank-sensitive data structures, which are additionally given a ranking of the n items, so that just the top k best-ranking items should be reported at query time, sorted in rank order, at a cost of O(t(n) + k) time. Note that k is part of the query as a parameter under the control of the user (as opposed to ℓ which is query-dependent). We explore the problem of adding rank-sensitivity to data structures such as suffix trees or range trees, where the ℓ items satisfying the query form O(polylog(n)) intervals of consecutive entries from which we choose the top k best-ranking ones. Letting s(n) be the number of items (including their copies) stored in the original data structures, we increase the space by an additional term of O(s(n) lg ǫ n) memory words of space, each of O(lg n) bits, for any positive constant ǫ < 1. We allow for changing the ranking on the fly during the lifetime of the data structures, with ranking values in 0 . . . O(n). In this case, query time becomes O(t(n)+k) plus O(lg n/ lg lg n) per interval; each change in the ranking and each insertion/deletion of an item takes O(lg n) time; the additional term in space occupancy increases to O(s(n) lg n/ lg lg n).

show abstract

Managing Gigabytes: Compressing and Indexing Documents and Images

Cited by 700 publications

References 1 publication

Toward a higher-level visual representation for content-based image retrieval

Toward a higher-level visual representation for content-based image retrieval

On the Feasibility of Peer-to-Peer Web Indexing and Search

Rank-Sensitive Data Structures

Contact Info

Product

Resources

About