2016
DOI: 10.1101/093898
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Shared Nearest Neighbor clustering in a Locality Sensitive Hashing framework

Abstract: We present a new algorithm to cluster high dimensional sequence data, and its application to the field of metagenomics, which aims to reconstruct individual genomes from a mixture of genomes sampled from an environmental site, without any prior knowledge of reference data (genomes) or the shape of clusters. Such problems typically cannot be solved directly with classical approaches seeking to estimate the density of clusters, e.g., using the shared nearest neighbors rule, due to the prohibitive size of contemp… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 45 publications
0
2
0
Order By: Relevance
“…Nearest neighbors' based methods for similarity matrix sparsification include k-nearest neighbor [7] and shared nearest neighbor [14]. k-nearest neighbor sparsification keeps only the k highest similarity scores for each text; the sharednearest neighbor approach adds a condition that texts retaining similarity values with a particular text should share a prescribed number of neighbors.…”
Section: Similarity Matrix Sparsificationmentioning
confidence: 99%
“…Nearest neighbors' based methods for similarity matrix sparsification include k-nearest neighbor [7] and shared nearest neighbor [14]. k-nearest neighbor sparsification keeps only the k highest similarity scores for each text; the sharednearest neighbor approach adds a condition that texts retaining similarity values with a particular text should share a prescribed number of neighbors.…”
Section: Similarity Matrix Sparsificationmentioning
confidence: 99%
“…That is, for any distinct trajectory pattern, when there are more than 150 patients in the cohort sharing this pattern, the prediction for new patients who also have this trajectory pattern will be accurate. Since patient numbers in EMR data can easily reach into the million range, it is promising to build a large database of trajectory patterns in existing patients and use locality sensitive hashing (28)(29)(30)(31) and approximate nearest neighbor query techniques (32,33) to predict disease progressions in new patients. clusters color-coded by kidney functions (blue: good, red: impaired); clusters enriched with encounters of patients of high-or low-risk APOL1 genotypes (middle and right, correspondingly).…”
Section: Prediction Of Chronic Disease Progressionmentioning
confidence: 99%