Fully Understanding the Hashing Trick

Freksen, Casper Benjamin; Kamma, Lior; Larsen, Kasper Green

doi:10.48550/arxiv.1805.08539

Cited by 2 publications

(2 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As hash function h the 32-bit version of the MurmurHash3 algorithm [1], a popular noncryptographic hash function, is used. It can be proven that under moderate assumptions feature hashing approximately conserves the Euclidean norm [10], and hence, the cosine similarity between hashed vectors can be used to approximate the similarity between the original, highdimensional vectors and spectra.…”

Section: Feature Hashing To Convert High-resolution Spectra To Low-di...mentioning

confidence: 99%

Large‐scale tandem mass spectrum clustering using fast nearest neighbor searching

Bittremieux

Laukens

Noble

et al. 2021

Rapid Comm Mass Spectrometry

View full text Add to dashboard Cite

Rationale: Advanced algorithmic solutions are necessary to process the ever increasing amounts of mass spectrometry data that is being generated. Here we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra. Methods: falcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to lowdimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively perform all pairwise spectrum comparisons within the relevant precursor mass tolerance. Finally, density-based clustering is performed to group similar spectra into clusters. Results: Several state-of-the-art spectrum clustering tools were evaluated using a large draft human proteome dataset consisting of 25 million spectra, indicating that alternative tools produce clustering results with different characteristics. Notably, falcon generates larger highly pure clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing. Conclusions: falcon is a highly efficient spectrum clustering tool. It is publicly available as open source under the permissive BSD license at https://github.com/bittremieux/falcon.

show abstract

Section: Feature Hashing To Convert High-resolution Spectra To Low-di...mentioning

confidence: 99%

Large‐scale tandem mass spectrum clustering using fast nearest neighbor searching

Bittremieux

Laukens

Noble

et al. 2021

Rapid Comm Mass Spectrometry

View full text Add to dashboard Cite

show abstract

“…It can be proven that under moderate assumptions feature hashing approximately conserves the Euclidean norm, 19 and hence, the similarity between hashed vectors can be used to approximate the similarity between the original, high-dimensional vectors. An important consideration in choosing hash function h is that it must be unbiased in order to minimize the number of hash collisions.…”

Section: Feature Hashing To Vectorize High-resolution Mass Spectramentioning

confidence: 99%

Extremely Fast and Accurate Open Modification Spectral Library Searching of High-Resolution Mass Spectra Using Feature Hashing and Graphics Processing Units

2019

View full text Add to dashboard Cite

Open modification searching (OMS) is a powerful search strategy to identify peptides with any type of modification. OMS works by using a very wide precursor mass window to allow modified spectra to match against their unmodified variants, after which the modification types can be inferred from the corresponding precursor mass differences. A disadvantage of this strategy, however, is the large computational cost, because each query spectrum has to be compared against a multitude of candidate peptides. We have previously introduced the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. Here we demonstrate how this candidate selection procedure can be further optimized using graphics processing units. Additionally, we introduce a feature hashing scheme to convert high-resolution spectra to low-dimensional vectors. Based on these algorithmic advances, along with low-level code optimizations, the new version of ANN-SoLo is up to an order of magnitude faster than its initial version. This makes it possible to efficiently perform open searches on a large scale to gain a deeper understanding about the protein modification landscape. We demonstrate the computational efficiency and identification performance of ANN-SoLo based on a large data set of the draft human proteome. ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license at https://github.com/bittremieux/ANN-SoLo .

show abstract

Fully Understanding the Hashing Trick

Cited by 2 publications

References 18 publications

Large‐scale tandem mass spectrum clustering using fast nearest neighbor searching

Large‐scale tandem mass spectrum clustering using fast nearest neighbor searching

Extremely Fast and Accurate Open Modification Spectral Library Searching of High-Resolution Mass Spectra Using Feature Hashing and Graphics Processing Units

Contact Info

Product

Resources

About