Amino acid alphabet reduction preserves fold information contained in contact interactions in proteins

Solis, Armando D.

doi:10.1002/prot.24936

Cited by 29 publications

(37 citation statements)

References 61 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To perform translated search, Kraken 2X first builds a database from a set of reference proteins in the same manner that Kraken 2 does for nucleotide sequences. The usual alphabet of 20 amino acids is reduced to 15 using the 15-character alphabet of Solis 31 ; we add a single additional value representing selenocysteine, pyrrolysine, and translation termination (stop codons). This gives us 16 characters in our reduced alphabet, allowing us to represent a character with 4 bits.…”

Section: Translated Searchmentioning

confidence: 99%

“…For Kraken 2, we performed two parameter sweeps, with one focused on minimizer-based subsampling, and one focused on hash-based subsampling. The first parameter sweep looked at values for ℓ in the interval [25,31], values for k in the interval [ℓ, ℓ+10], and values for s in the interval [0, 7]; the second parameter sweep looked at values of ℓ in the interval [25,31], fixed k= ℓ, values for f in the set {0.125, 0.25, 0.5}, and values for s in the interval [0,7]. We also performed a third parameter sweep, focused on translated search (Kraken 2X), where we looked at values for ℓ in the interval [11,15], values for k in the interval [ℓ, ℓ+3], and values for s in the interval [0, 3].…”

Section: Parameter Sweepsmentioning

confidence: 99%

See 1 more Smart Citation

Improved metagenomic analysis with Kraken 2

Wood

Langmead

2019

Preprint

712

887

View full text Add to dashboard Cite

Although Kraken's k-mer-based approach provides fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed five-fold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.Assigning taxonomic labels to sequencing reads is an important part of many computational genomics pipelines for metagenomics projects. Recent years have seen several approaches to accomplish this task in a time-efficient manner 1-3 . Kraken 4 used a memory-intensive algorithm that associates short genomic substrings (k-mers) with lowest common ancestor (LCA) taxa. Kraken and related tools like KrakenUniq 5 have proven highly efficient and accurate in other tool comparisons 6,7 . But Kraken's high memory requirements force many researchers to either use a reduced-sensitivity MiniKraken database 8,9 , or to build and use many indexes over subsets of the reference sequences 10,11 . Its memory requirements can easily exceed 100 GB 7 , especially when the reference data includes large eukaryotic genomes 12,13 . Here we introduce Kraken 2, which provides a major reduction in memory usage as well as faster classification, a spaced-seed searching scheme, a translated search mode for matching in amino acid space, and continued compatibility with the Bracken 14 species-level quantification algorithm.Kraken 2 addresses the issue of large memory requirements through two changes to Kraken 1's data structures and algorithms. While Kraken 1 used a sorted list of k-mer/LCA pairs indexed by minimizers 15 , Kraken 2 introduces a probabilistic, compact hash table to map minimizers to LCAs. This table uses one-third of the memory of a standard hash table, at the cost of some specificity and accuracy. Additionally, Kraken 2 only stores minimizers (of length ℓ, ℓ ≤ k) from the reference sequence library in the data structure, whereas Kraken 1's stored all k-mers. Kraken 2's index for a reference database consisting of 9.1 Gbp of genomic sequence uses 10.6 gigabytes of memory at classification time. Kraken 1's index for the same reference set uses 72.4 gigabytes of memory for classification (Figure 1a, Supplementary Table S1). In general, a Kraken 2 database is about 15% as large as a Kraken 1 database over the same references (Supplementary Figure S1).Kraken 2's approach is faster than Kraken 1's because only the distinct minimizers from the query (read) trigger accesses to the hash table. A similar minimizer-based approach has proven useful in accelerating read alignment 16 . Kraken 2 additionally provides a hash-based filtering approach that subsamples the set of minimizer/LCA pairs included in the table, allowing the user to specify a target hash table size; smaller hash tables yield lower memory footprint and higher classification throughput at the expens...

show abstract

Section: Translated Searchmentioning

confidence: 99%

Section: Parameter Sweepsmentioning

confidence: 99%

Improved metagenomic analysis with Kraken 2

Wood

Langmead

2019

Preprint

712

887

View full text Add to dashboard Cite

show abstract

“…This idea is implicitly included in other simplifications. Practically, the sequence-structure mapping could be described with information gains comparing to the random sequence [103], the ability of family recognition [104], as well as the mutual information based on contact population [105], and so on. Due to the complexity of the concerned data, some optimization methods are employed to search the optimal groupings which Figure 5.…”

Section: Simplification Based On Features Of Protein Systemsmentioning

confidence: 99%

“…Similarities are observed for the groupings with various methods. [105,115] Meanwhile, together with the simplifications, some features may not be correctly described after simplifications. For example, experiments and simulations suggest that the HP grouping could not describe all the features of natural proteins [92,95] though it is a common feature of proteins.…”

Section: Minimal Groupings Of Protein Systemsmentioning

confidence: 99%

“…While further considering the dynamic and functional behaviors, 8-10 groups of amino acids would be necessary [50,79,90,92,93,102,[119][120][121]. Together with all these considerations, the minimal set to fulfill various kinds of biological functions might be the alphabet with 10 amino acids, which are verified with sequence design and homolog recognition [79,93,102,105,110,122]. This measures the possible complexity of protein systems.…”

Section: Minimal Groupings Of Protein Systemsmentioning

confidence: 99%

See 1 more Smart Citation

Simplification of complexity in protein molecular systems by grouping amino acids: a view from physics

Wang

2016

Advances in Physics: X

View full text Add to dashboard Cite

General Methodology to Identify the Minimum Alphabet Size for Heteropolymer Design

Cardelli

Nerattini

Tubiana

et al. 2019

Advcd Theory and Sims

View full text Add to dashboard Cite

Understanding how to design the structure of heteropolymers through their monomer sequence will have a significant impact on the creation of novel artificial materials. According to mean‐field theories, the minimum number—or alphabet—of distinct monomers necessary to achieve such designability is directly related to the conformational entropy ω of compact polymer structures. Here, a computational strategy to calculate this conformational entropy is introduced and thus predict the minimum alphabet to achieve designability, for a generalized heteropolymer model. The comparison of the predictions with previous results proves the robustness of the approach. It is quantified for the first time how the number of directional interactions is critical for achieving the designability. The methodology that is introduced can be easily generalized to models representing specific polymers. A comparison between conventional polymers monomers are provided, and it is predicted that polyurea, polyamide, and polyurethane residues are optimal candidates to be functionalized for the experimental synthesis of designable heteropolymers. As such, our method can guide the engineering of new types of self‐assembling modular polymers, that will open new possibilities for polymer‐based materials with unmatched versatility and control.

show abstract

Amino acid alphabet reduction preserves fold information contained in contact interactions in proteins

Cited by 29 publications

References 61 publications

Improved metagenomic analysis with Kraken 2

Improved metagenomic analysis with Kraken 2

Simplification of complexity in protein molecular systems by grouping amino acids: a view from physics

General Methodology to Identify the Minimum Alphabet Size for Heteropolymer Design

Contact Info

Product

Resources

About