Mauricio Oyarzún scite author profile

2010

Information Processing & Management

Abstract. We prove that a document collection, represented as a unique sequence T of n terms over a vocabulary Σ, can be represented in nH0(T ) + o(n)(H0(T ) + 1) bits of space, such that a conjunctive query t1 ∧ · · · ∧ t k can be answered in O(kδ log log |Σ|) adaptive time, where δ is the instance difficulty of the query, as defined by Barbay and Kenyon in their SODA'02 paper, and H0(T ) is the empirical entropy of order 0 of T . As a comparison, using an inverted index plus the adaptive intersection algorithm by Barbay and Kenyon takes O(kδ log), where nM is the length of the shortest and longest occurrence lists, respectively, among those of the query terms. Thus, we can replace an inverted index by a more space-efficient in-memory encoding, outperforming the query performance of inverted indices when the ratio n M δ is ω(log |Σ|).

show abstract

Distributed search based on self-indexed compressed text

Arroyuelo¹,

Gil-Costa

González³

et al. 2012

Document identifier reassignment and run-length-compressed inverted indexes for improved search performance

Arroyuelo

et al. 2013

Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: the inverted indexes. Currently, inverted indexes can be represented very efficiently using index compression schemes. Recent investigations also study how an optimized document ordering can be used to assign document identifiers (docIDs) to the document database. This yields important improvements in index compression and query processing time. In this paper we follow this line of research, yet from a different perspective. We propose a docID reassignment method that allows one to focus on a given subset of inverted lists to improve their performance. We then use run-length encoding to compress these lists (as many consecutive 1s are generated). We show that by using this approach, not only the performance of the particular subset of inverted lists is improved, but also that of the whole inverted index. Our experimental results indicate a reduction of about 10% in the space usage of the whole index (just regarding docIDs), and up to 30% if we regard only the particular subset of list on which the docID reassignment was focused. Also, decompression speed is up to 1.22 times faster if the runs must be explicitly decompressed and up to 4.58 times faster if implicit decompression of runs is allowed. Finally, we also improve the Document-at-a-Time query processing time of AND queries (by up to 12%), WAND queries (by up to 23%) and full (non-ranked) OR queries (by up to 86%).

show abstract

To index or not to index: Time–space trade-offs for positional ranking functions in search engines

Arroyuelo

et al. 2020

Information Systems

Hybrid compression of inverted lists for reordered document collections

Arroyuelo

Information Processing & Management

et al. 2018