Jiancong Tong scite author profile

Jiancong Tong

5Publications

59Citation Statements Received

304Citation Statements Given

How they've been cited

How they cite others

212

301

Affiliations

Baidu (China), Nankai University, University of Melbourne

Publications

Order By: Most citations

Principled dictionary pruning for low-memory corpus compression

Tong

Wirth

Zobel

2014

View full text Add to dashboard Cite

Compression of collections, such as text databases, can both reduce space consumption and increase retrieval efficiency, through better caching and better exploitation of the memory hierarchy. A promising technique is relative Lempel-Ziv coding, in which a sample of material from the collection serves as a static dictionary; in previous work, this method demonstrated extremely fast decoding and good compression ratios, while allowing random access to individual items. However, there is a trade-off between dictionary size and compression ratio, motivating the search for a compact, yet similarly effective, dictionary.In previous work it was observed that, since the dictionary is generated by sampling, some of it (selected substrings) may be discarded with little loss in compression. Unfortunately, simple dictionary pruning approaches are ineffective. We develop a formal model of our approach, based on generating an optimal dictionary for a given collection within a memory bound. We generate measures for identification of low-value substrings in the dictionary, and show on a variety of sizes of text collection that halving the dictionary size leads to only marginal loss in compression ratio. This is a dramatic improvement on previous approaches.

show abstract

The impact of solid state drive on search engine cache management

Wang

Yiu

et al. 2013

View full text Add to dashboard Cite

Caching is an important optimization in search engine architectures. Existing caching techniques for search engine optimization are mostly biased towards the reduction of random accesses to disks, because random accesses are known to be much more expensive than sequential accesses in traditional magnetic hard disk drive (HDD). Recently, solid state drive (SSD) has emerged as a new kind of secondary storage medium, and some search engines like Baidu have already used SSD to completely replace HDD in their infrastructure. One notable property of SSD is that its random access latency is comparable to its sequential access latency. Therefore, the use of SSDs to replace HDDs in a search engine infrastructure may void the cache management of existing search engines. In this paper, we carry out a series of empirical experiments to study the impact of SSD on search engine cache management. The results give insights to practitioners and researchers on how to adapt the infrastructure and how to redesign the caching policies for SSDbased search engines.

show abstract

Cache Design of SSD-Based Search Engine Architectures

Wang

Yiu

et al. 2014

ACM Trans. Inf. Syst.

View full text Add to dashboard Cite

Caching is an important optimization in search engine architectures. Existing caching techniques for search engine optimization are mostly biased towards the reduction of random accesses to disks, because random accesses are known to be much more expensive than sequential accesses in traditional magnetic hard disk drive (HDD). Recently, solid-state drive (SSD) has emerged as a new kind of secondary storage medium, and some search engines like Baidu have already used SSD to completely replace HDD in their infrastructure. One notable property of SSD is that its random access latency is comparable to its sequential access latency. Therefore, the use of SSDs to replace HDDs in a search engine infrastructure may void the cache management of existing search engines. In this article, we carry out a series of empirical experiments to study the impact of SSD on search engine cache management. Based on the results, we give insights to practitioners and researchers on how to adapt the infrastructure and caching policies for SSD-based search engines.

show abstract

Leveraging Context-Free Grammar for Efficient Inverted Index Compression

Zhang

Tong

Huang

et al. 2016

View full text Add to dashboard Cite

Compact Auxiliary Dictionaries for Incremental Compression of Large Repositories

Tong

Wirth

Zobel

2014

View full text Add to dashboard Cite

Compression is widely exploited in retrieval systems, such as search engines and text databases, to lower both retrieval costs and system latency. In particular, compression of repositories can reduce storage requirements and fetch times, while improving caching. One of the most effective techniques is relative Lempel-Ziv, RLZ, in which a RAM-resident dictionary encodes the collection. With RLZ, a specified document can be decoded independently and extremely fast, while maintaining a high compression ratio. For terabytescale collections, this dictionary need only be a fraction of a per cent of the original data size. However, as originally described, RLZ uses a static dictionary, against which encoding of new data may be inefficient. An obvious alternative is to generate a new dictionary solely from the new data. However, this approach may not be scalable because the combined RAM-resident dictionary will grow in proportion to the collection.In this paper, we describe effective techniques for extending the original dictionary to manage new data. With these techniques, a new auxiliary dictionary, relatively limited in size, is created by interrogating the original dictionary with the new data. Then, to compress this new data, we combine the auxiliary dictionary with some parts of the original dictionary (the latter in fact encoded as pointers into that original dictionary) to form a second dictionary. Our results show that excellent compression is available with only small auxiliary dictionaries, so that RLZ can feasibly transmit and store large, growing collections.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jiancong Tong

Principled dictionary pruning for low-memory corpus compression

The impact of solid state drive on search engine cache management

Cache Design of SSD-Based Search Engine Architectures

Leveraging Context-Free Grammar for Efficient Inverted Index Compression

Compact Auxiliary Dictionaries for Incremental Compression of Large Repositories

Contact Info

Product

Resources

About