Motivation
Bloom filters are a popular data structure that allows rapid searches in large sequence datasets. So far, all tools work with nucleotide sequences; however, protein sequences are conserved over longer evolutionary distances, and only mutations on the protein level may have any functional significance.
Results
We present MetaProFi, a Bloom filter-based tool that, for the first time, offers the functionality to build indexes of amino acid sequences and query them with both amino acid and nucleotide sequences, thus bringing sequence comparison to the biologically relevant protein level. MetaProFi implements additional efficient engineering solutions, such as a shared memory system, chunked data storage, and efficient compression. In addition to its conceptual novelty, MetaProFi demonstrates state-of-the-art performance and excellent memory consumption-to-speed ratio when applied to various large datasets.
Availability
Source code in Python is available at https://github.com/kalininalab/metaprofi.
Technological advances of next-generation sequencing present new computational challenges to develop methods to store and query these data in time- and memory-efficient ways. We present MetaProFi (https://github.com/kalininalab/metaprofi), a Bloom filter-based tool that, in addition to supporting nucleotide sequences, can for the first time directly store and query amino acid sequences and translated nucleotide sequences, thus bringing sequence comparison to a more biologically relevant protein level. Owing to the properties of Bloom filters, it has a zero false-negative rate, allows for exact and inexact searches, and leverages disk storage and Zstandard compression to achieve high time and space efficiency. We demonstrate the utility of MetaProFi by indexing UniProtKB datasets at organism- and at sequence-level in addition to the indexing of Tara Oceans dataset and the 2585 human RNA-seq experiments, showing that MetaProFi consumes far less disk space than state-of-the-art-tools while also improving performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.