Shuai Ding scite author profile

Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by avoiding the scoring of documents that are unlikely to be in the top results. We study new algorithms for early termination that outperform previous methods. In particular, we focus on safe techniques for disjunctive queries, which return the same result as an exhaustive evaluation over the disjunction of the query terms. The current state-of-the-art methods for this case, the WAND algorithm by Broder et al. [11] and the approach of Strohman and Croft [30], achieve great benefits but still leave a large performance gap between disjunctive and (even non-early terminated) conjunctive queries.We propose a new set of algorithms by introducing a simple augmented inverted index structure called a block-max index. Essentially, this is a structure that stores the maximum impact score for each block of a compressed inverted list in uncompressed form, thus enabling us to skip large parts of the lists. We show how to integrate this structure into the WAND approach, leading to considerable performance gains. We then describe extensions to a layered index organization, and to indexes with reassigned document IDs, that achieve additional gains that narrow the gap between disjunctive and conjunctive top-k query processing.

show abstract

Scalable techniques for document identifier assignment in inverted indexes

Ding

Attenberg

Suel

2010

View full text Add to dashboard Cite

Web search engines use the full-text inverted index data structure. Because query processing performance is impacted by the size of the inverted index, a plethora of research has focused on fast and effective techniques for compressing this structure. Recently, researchers have proposed techniques for improving index compression by optimizing the assignment of document identifiers in the collection, leading to significant reduction in overall index size.In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.

show abstract

Using graphics processors for high-performance IR query processing

Ding

Yan

et al. 2008

View full text Add to dashboard Cite

Compressing term positions in web indexes

Yan

Ding

Suel

2009

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shuai Ding

Inverted index compression and query processing with optimized document ordering

Faster top-k document retrieval using block-max indexes

Scalable techniques for document identifier assignment in inverted indexes

Using graphics processors for high-performance IR query processing

Compressing term positions in web indexes

Contact Info

Product

Resources

About