2015
DOI: 10.1016/j.cels.2015.08.004
|View full text |Cite
|
Sign up to set email alerts
|

Entropy-Scaling Search of Massive Biological Data

Abstract: Summary Many data sets exhibit well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here we introduce a framework for similarity search based on characterizing a data set’s entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the data set is low, and scales in space with the sum of metric entropy and information-theoretic entro… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
90
0

Year Published

2015
2015
2019
2019

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 58 publications
(91 citation statements)
references
References 62 publications
1
90
0
Order By: Relevance
“…While a single sequence alignment against a reference database of a few million sequences may take a few milliseconds to 2 to 3 s on a given computer, the comparison of millions of sequences against millions of references will cause serious scalability problems and challenges. Faster heuristic comparative methods (56,57) will have to be developed and implemented in a cloudor grid-based environment.…”
Section: Proposal For a Cloud-based Dynamic Database Network Platformmentioning
confidence: 99%
“…While a single sequence alignment against a reference database of a few million sequences may take a few milliseconds to 2 to 3 s on a given computer, the comparison of millions of sequences against millions of references will cause serious scalability problems and challenges. Faster heuristic comparative methods (56,57) will have to be developed and implemented in a cloudor grid-based environment.…”
Section: Proposal For a Cloud-based Dynamic Database Network Platformmentioning
confidence: 99%
“…To build the community profile of a given gene/protein sequence (herein referred to as the query sequence ), RAPSearch2 [29] (other fast similarity search tools including Diamond [30] and MICA [31] can also be utilized) is applied to search the sequence against a set of metagenomes. As described in details below, the set of metagenomes (herein referred to as the reference metagenomes ) were downloaded from metagenomic data repositories.…”
Section: Methodsmentioning
confidence: 99%
“…38 The first critical observation is that much biological data is highly redundant; if a computation is performed on one human genome, and a researcher wishes to perform the same computation on another human genome, most of the work has already been done. 22 When dealing with redundant data, clustering comes to mind.…”
Section: Structure Of Biological Datamentioning
confidence: 99%
“…38 That is, for a given cluster radius r c and a database D , the number k of clusters needed to cover D is bounded by N r c ( D ), the metric entropy, which is relatively small compared to | D |, the number of entries in the database (Figure 3). In contrast, if the points were uniformly distributed about the Cartesian space, N r c ( D ) would be larger.…”
Section: Structure Of Biological Datamentioning
confidence: 99%
See 1 more Smart Citation