Kave Eshghi scite author profile

Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality, which existing deduplication techniques require to provide reasonable throughput.We present Extreme Binning, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time. Due to lack of locality, existing techniques perform poorly on these workloads. Extreme Binning exploits file similarity instead of locality, and makes only one disk access for chunk lookup per file, which gives reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the amount of input data; more backup nodes can be added to boost throughput. Each file is allocated using a stateless routing algorithm to only one node, allowing for maximum parallelization, and each backup node is autonomous with no dependency across nodes, making data management tasks robust with low overhead.

show abstract

Finding similar files in large document repositories

Forman

Eshghi

Chiocchetti

2005

View full text Add to dashboard Cite

Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction.The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing.We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability.

show abstract

Locality sensitive hash functions based on concomitant rank order statistics

Eshghi

Rajaram

2008

View full text Add to dashboard Cite

Locality Sensitive Hash functions are invaluable tools for approximate near neighbor problems in high dimensional spaces. In this work, we are focused on LSH schemes where the similarity metric is the cosine measure. The contribution of this work is a new class of locality sensitive hash functions for the cosine similarity measure based on the theory of concomitants, which arises in order statistics. Con-Concomitant theory captures the relation between the order statistics of X and Y in the form of a rank distribution given by Prob(Rank(Yi)=j|Rank(Xi)=k). We exploit properties of the rank distribution towards developing a locality sensitive hash family that has excellent collision rate properties for the cosine measure.The computational cost of the basic algorithm is high for high hash lengths. We introduce several approximations based on the properties of concomitant order statistics and discrete transforms that perform almost as well, with significantly reduced computational cost. We demonstrate the practical applicability of our algorithms by using it for finding similar images in an image repository.

show abstract

CROification: Accurate Kernel Classification with the Efficiency of Sparse Linear SVM

Kafai

Eshghi

2019

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Kernel methods have been shown to be effective for many machine learning tasks such as classification and regression. In particular, support vector machines with the Gaussian kernel have proved to be powerful classification tools. The standard way to apply kernel methods is to use the kernel trick, where the inner product of the vectors in the feature space is computed via the kernel function. Using the kernel trick for SVMs, however, leads to training that is quadratic in the number of input vectors and classification that is linear with the number of support vectors.

show abstract

Efficient detection of large-scale redundancy in enterprise file systems

Forman

Eshghi

Suermondt

2009

SIGOPS Oper. Syst. Rev.

View full text Add to dashboard Cite

In order to catch and reduce waste in the exponentially increasing demand for disk storage, we have developed very efficient technology to detect approximate duplication of large directory hierarchies. Such duplication can be caused, for example, by unnecessary mirroring of repositories by uncoordinated employees or departments. Identifying these duplicate or nearduplicate hierarchies allows appropriate action to be taken at a high level. For example, one could coordinate and consolidate multiple copies in one location.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Kave Eshghi

Extreme Binning: Scalable, parallel deduplication for chunk-based file backup

Finding similar files in large document repositories

Locality sensitive hash functions based on concomitant rank order statistics

CROification: Accurate Kernel Classification with the Efficiency of Sparse Linear SVM

Efficient detection of large-scale redundancy in enterprise file systems

Contact Info

Product

Resources

About