Background
Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials.
Results
Here we present a new approach to Data Mining in large protein sequences datasets, the
Rapid Alignment Free Tool for Sequences Similarity Search to Groups
(RAFTS
3
G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS
3
G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering.
Conclusion
In general, RAFTS
3
G is able to group up to millions of biological sequences into large datasets, which is a remarkable option of efficiency in clustering. RAFTS
3
G compared to other “standard-gold” methods in the clustering of large biological data maintains the balance between the reduction of biological information redundancy and the creation of consistent groups. We bring the binary search concept applied to grouped sequences which shows maintaining sensitivity/accuracy relation and up to minimize the time of data generated with RAFTS
3
G process.
Electronic supplementary material
The online version of this article (10.1186/s12859-019-2973-4) contains supplementary material, which is available to authorized users.
Alternative splicing (AS) may increase the number of proteoforms produced by a gene. Alzheimer’s disease (AD) is a neurodegenerative disease with well-characterized AS proteoforms. In this study, we used a proteogenomics strategy to build a customized protein sequence database and identify orthologous AS proteoforms between humans and mice on publicly available shotgun proteomics (MS/MS) data of the corpus callosum (CC) and olfactory bulb (OB). Identical proteotypic peptides of six orthologous AS proteoforms were found in both species: PKM1 (gene PKM/Pkm), STXBP1a (gene STXBP1/Stxbp1), Isoform 3 (gene HNRNPK/Hnrnpk), LCRMP-1 (gene CRMP1/Crmp1), SP3 (gene CADM1/Cadm1), and PKCβII (gene PRKCB/Prkcb). These AS variants were also detected at the transcript level by publicly available RNA-Seq data and experimentally validated by RT-qPCR. Additionally, PKM1 and STXBP1a were detected at higher abundances in a publicly available MS/MS dataset of the AD mouse model APP/PS1 than its wild type. These data corroborate other reports, which suggest that PKM1 and STXBP1a AS proteoforms might play a role in amyloid-like aggregate formation. To the best of our knowledge, this report is the first to describe PKM1 and STXBP1a overexpression in the OB of an AD mouse model. We hope that our strategy may be of use in future human neurodegenerative studies using mouse models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.