Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the single-instruction, multiple data (SIMD) instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded 32-bit integer while still providing state-of-the-art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decompression speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD GALLOPING algorithm. We exploit the fact that one SIMD instruction can compare four pairs of 32-bit integers at once. We experiment with two Text REtrieval Conference (TREC) text collections, GOV2 and ClueWeb09 (category B), using logs from the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state-of-the-art approach.
Counting the number of ones in a binary stream is a common operation in database, information-retrieval, cryptographic and machine-learning applications. Most processors have dedicated instructions to count the number of ones in a word (e.g., popcnt on x64 processors). Maybe surprisingly, we show that a vectorized approach using SIMD instructions can be twice as fast as using the dedicated instructions on recent Intel processors. The benefits can be even greater for applications such as similarity measures (e.g., the Jaccard index) that require additional Boolean operations. Our approach has been adopted by LLVM: it is used by its popular C compiler (Clang). 1 The x64 popcnt instruction was first available in the Nehalem microarchitecture, announced in 2007 and released in November 2008. The ARM cnt instruction was released as part of the Cortex-A8 microarchitecture, published in 2006 [16].uint64_t harley_seal ( uint64_t * d , size_t size ) { uint64_t total = 0 , ones = 0 , twos = 0 , fours = 0 , eights = 0 , sixteens = 0; uint64_t twosA , twosB , foursA , foursB , eightsA , eightsB ; for ( size_t i = 0; i < sizesize % 16; i += 16) { FIGURE 8. A C function implementing the Harley-Seal population count over an array of 64-bit words. The count function could be the Wilkes-Wheeler-Gill function.
Summary On common processors, integer multiplication is many times faster than integer division. Dividing a numerator n by a divisor d is mathematically equivalent to multiplication by the inverse of the divisor (n/d=n∗1/d). If the divisor is known in advance, or if repeated integer divisions will be performed with the same divisor, it can be beneficial to substitute a less costly multiplication for an expensive division. Currently, the remainder of the division by a constant is computed from the quotient by a multiplication and a subtraction. However, if just the remainder is desired and the quotient is unneeded, this may be suboptimal. We present a generally applicable algorithm to compute the remainder more directly. Specifically, we use the fractional portion of the product of the numerator and the inverse of the divisor. On this basis, we also present a new and simpler divisibility algorithm to detect nonzero remainders. We also derive new tight bounds on the precision required when representing the inverse of the divisor. Furthermore, we present simple C implementations that beat the optimized code produced by state‐of‐the‐art C compilers on recent x64 processors (eg, Intel Skylake and AMD Ryzen), sometimes by more than 25%. On all tested platforms, including 64‐bit ARM and POWER8, our divisibility test functions are faster than state‐of‐the‐art Granlund‐Montgomery divisibility test functions, sometimes by more than 50%.
Arrays of integers are often compressed in search engines. Though there are many ways to compress integers, we are interested in the popular byte-oriented integer compression techniques (e.g., VByte or Google's VARINT-GB). Although not known for their speed, they are appealing due to their simplicity and engineering convenience. Amazon's VARINT-G8IU is one of the fastest byte-oriented compression technique published so far. It makes judicious use of the powerful single-instruction-multiple-data (SIMD) instructions available in commodity processors. To surpass VARINT-G8IU, we present STREAM VBYTE, a novel byte-oriented compression technique that separates the control stream from the encoded data. Like VARINT-G8IU, STREAM VBYTE is well suited for SIMD instructions. We show that STREAM VBYTE decoding can be up to twice as fast as VARINT-G8IU decoding over real data sets. In this sense, STREAM VBYTE establishes new speed records for byte-oriented integer compression, at times exceeding the speed of the memcpy function. On a 3.4 GHz Haswell processor, it decodes more than 4 billion differentially-coded integers per second from RAM to L1 cache.
Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries. They represent sets and often support operations such as unions, intersections, differences, and symmetric differences. Several important systems such as Elasticsearch, Apache Spark, Netflix's Atlas, LinkedIn's Pivot, Metamarkets' Druid, Pilosa, Apache Hive, Apache Tez, Microsoft Visual Studio Team Services, and Apache Kylin rely on a specific type of compressed bitmap index called Roaring. We present an optimized software library written in C implementing Roaring bitmaps: CRoaring. It benefits from several algorithms designed for the single-instruction-multiple-data instructions available on commodity processors. In particular, we present vectorized algorithms to compute the intersection, union, difference, and symmetric difference between arrays. We benchmark the library against a wide range of competitive alternatives, identifying weaknesses and strengths in our software. Our work is available under a liberal open-source license.• We present several nontrivial algorithmic optimizations (see Table 1). In particular, we show that a collection of algorithms exploiting SIMD instructions can enhance the performance of a data structure like Roaring in some cases, above and beyond what state-of-the-art optimizing compilers can achieve. To our knowledge, it is the first work to report on the benefits of advanced SIMD-based algorithms for compressed bitmaps.Although the approach we use to compute array intersections using SIMD instructions in Section 4.2 is not new, 22,23 our work on the computation of the union (Section 4.3), difference (Section 4.4), and symmetric difference (Section 4.4) of arrays using SIMD instructions might be novel and of general interest.• We benchmark our C library against a wide range of alternatives in C and C++. Our results provide guidance as to the strengths and weaknesses of our implementation.We focus primarily on our novel implementation and the lessons we learned: we refer to earlier work for details regarding the high-level algorithmic design of Roaring bitmaps. 18,19 Because our library is freely available under a liberal open-source license, we hope that our work will be used to accelerate information systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.