MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors

Bonidia, Robson Parmezan; Domingues, Douglas Silva; Sanches, Danilo Sipoli; Carvalho, André C. P. L. F. de

doi:10.1093/bib/bbab434

Cited by 50 publications

(35 citation statements)

References 88 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This module, which is the first feature engineering stage, extracts feature descriptors using the MathFeature package [ 28 ], e.g. Mathematical descriptors (Fourier, Shannon, Tsallis, among others) and Conventional descriptors [Nucleic Acid Composition (NAC), dinucleotide composition (DNC), trinucleotide composition (TNC), ORF Features, Xmer k-Spaced Ymer composition frequency (kGap), Fickett score, among others].…”

Section: Bioautoml Packagementioning

confidence: 99%

“…These tasks can include different ways of preprocessing or feature engineering, as well as algorithms and optimization of its parameters (hyper-parameter tuning) [ 24–26 ]. In this study, BioAutoML calls the MathFeature package [ 27 , 28 ] to extract feature descriptors representing relevant numerical information from ncRNA sequences ( Feature Extraction module ). After receiving the feature values, BioAutoML, automatically recommends, using Bayesian Optimization [ 29 ], the best pair of selected features and predictive model.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria

Bonidia

Santos

Almeida

et al. 2022

Briefings in Bioinformatics

Self Cite

View full text Add to dashboard Cite

Recent technological advances have led to an exponential expansion of biological sequence data and extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge has improved the understanding of mechanisms related to several fatal diseases, e.g. Cancer and coronavirus disease 2019, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine and precision medicine. These advances benefit our society and economy, directly impacting people’s lives in various areas, such as health care, drug discovery, forensic analysis and food processing. Nevertheless, ML-based approaches to biological data require representative, quantitative and informative features. Many ML algorithms can handle only numerical data, and therefore sequences need to be translated into a numerical feature vector. This process, known as feature extraction, is a fundamental step for developing high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with design and selection of suitable features. Feature engineering, ML algorithm selection and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge. To deal with this problem, we present a new package: BioAutoML. BioAutoML automatically runs an end-to-end ML pipeline, extracting numerical and informative features from biological sequence databases, using the MathFeature package, and automating the feature selection, ML algorithm(s) recommendation and tuning of the selected algorithm(s) hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules: (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyper-parameter tuning modules). We experimentally evaluate BioAutoML in two different scenarios: (i) prediction of the three main classes of noncoding RNAs (ncRNAs) and (ii) prediction of the eight categories of ncRNAs in bacteria, including housekeeping and regulatory types. To assess BioAutoML predictive performance, it is experimentally compared with two other AutoML tools (RECIPE and TPOT). According to the experimental results, BioAutoML can accelerate new studies, reducing the cost of feature engineering processing and either keeping or improving predictive performance. BioAutoML is freely available at https://github.com/Bonidia/BioAutoML.

show abstract

Section: Bioautoml Packagementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria

Bonidia

Santos

Almeida

et al. 2022

Briefings in Bioinformatics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Many feature engineering tools that target DNA, RNA, proteins and ligands were released in recent years. They include PseAAC ( 12 ), PROFEAT ( 13 ), PseAAC-Builder ( 14 ), PyDPI ( 15 ), ChemoPy ( 16 ), propy ( 17 ), RDKit ( 18 ), PseAAC-General ( 19 ), Rcpi ( 20 ), ProFET ( 21 ), protr/ProtrWeb ( 22 ), BioTriangle ( 23 ), repRNA ( 24 ), POSSUM ( 25 ), PseKRAAC ( 26 ), iFeature ( 27 ), PyFeat ( 28 ), Seq2Feature ( 29 ), MRMD2.0 ( 30 ) and MathFeature ( 31 ). Besides these feature engineering tools, several platforms for the development of machine learning predictors, including BioSeq-Analysis2.0 ( 32 ), PFeature ( 33 ), iLearn ( 34 ) and iLearnPlus ( 5 ), also provide feature extraction facilities.…”

Section: Introductionmentioning

confidence: 99%

“…Only BioTriangle covers DNA, RNA, ligands and protein sequences, however, it does not consider protein structures. MathFeature ( 31 ) and some of the recent machine learning platforms, such as BioSeq-Analysis2.0 ( 32 ), iLearn ( 34 ) and iLearnPlus ( 5 ), provide a relatively rich collection of feature sets for nucleic acids and proteins, outperforming the older feature engineering tools, however, they do not consider ligands and protein structures, except for PFeature ( 33 ) that considers only protein sequences and structures. A few current tools, including PIC ( 46 ), PDBparam ( 47 ) and PFeature, encode features from protein structure, facilitating important applications, such as rational drug development ( 48 ) and prediction of protein functions ( 49–52 ).…”

Section: Introductionmentioning

confidence: 99%

iFeatureOmega:an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets

Chen

Liu

Zhao

et al. 2022

Nucleic Acids Research

View full text Add to dashboard Cite

The rapid accumulation of molecular data motivates development of innovative approaches to computationally characterize sequences, structures and functions of biological and chemical molecules in an efficient, accessible and accurate manner. Notwithstanding several computational tools that characterize protein or nucleic acids data, there are no one-stop computational toolkits that comprehensively characterize a wide range of biomolecules. We address this vital need by developing a holistic platform that generates features from sequence and structural data for a diverse collection of molecule types. Our freely available and easy-to-use iFeatureOmega platform generates, analyzes and visualizes 189 representations for biological sequences, structures and ligands. To the best of our knowledge, iFeatureOmega provides the largest scope when directly compared to the current solutions, in terms of the number of feature extraction and analysis approaches and coverage of different molecules. We release three versions of iFeatureOmega including a webserver, command line interface and graphical interface to satisfy needs of experienced bioinformaticians and less computer-savvy biologists and biochemists. With the assistance of iFeatureOmega, users can encode their molecular data into representations that facilitate construction of predictive models and analytical studies. We highlight benefits of iFeatureOmega based on three research applications, demonstrating how it can be used to accelerate and streamline research in bioinformatics, computational biology, and cheminformatics areas. The iFeatureOmega webserver is freely available at http://ifeatureomega.erc.monash.edu and the standalone versions can be downloaded from https://github.com/Superzchen/iFeatureOmega-GUI/ and https://github.com/Superzchen/iFeatureOmega-CLI/.

show abstract

“…Nevertheless, ML algorithms applied to the analysis of biological sequences present challenges, such as feature extraction [ 10 ]. For non-structured data, as is the case of biological sequences, feature extraction is a key step for the success of ML applications [ 11 , 12 , 13 ].…”

Section: Introductionmentioning

confidence: 99%

Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy

Bonidia

Avila-Santos

Almeida

et al. 2022

Entropy

View full text Add to dashboard Cite

In recent years, there has been an exponential growth in sequencing projects due to accelerated technological advances, leading to a significant increase in the amount of data and resulting in new challenges for biological sequence analysis. Consequently, the use of techniques capable of analyzing large amounts of data has been explored, such as machine learning (ML) algorithms. ML algorithms are being used to analyze and classify biological sequences, despite the intrinsic difficulty in extracting and finding representative biological sequence methods suitable for them. Thereby, extracting numerical features to represent sequences makes it statistically feasible to use universal concepts from Information Theory, such as Tsallis and Shannon entropy. In this study, we propose a novel Tsallis entropy-based feature extractor to provide useful information to classify biological sequences. To assess its relevance, we prepared five case studies: (1) an analysis of the entropic index q; (2) performance testing of the best entropic indices on new datasets; (3) a comparison made with Shannon entropy and (4) generalized entropies; (5) an investigation of the Tsallis entropy in the context of dimensionality reduction. As a result, our proposal proved to be effective, being superior to Shannon entropy and robust in terms of generalization, and also potentially representative for collecting information in fewer dimensions compared with methods such as Singular Value Decomposition and Uniform Manifold Approximation and Projection.

show abstract

MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors

Cited by 50 publications

References 88 publications

BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria

BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria

iFeatureOmega:an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets

Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy

Contact Info

Product

Resources

About