Riboswitch, a part of mRNA (50-250nt in length), has two main classes: aptamer and expression platform. One of the main challenges raised during the classification of riboswitch is imbalanced data.That is a circumstance in which the records of a dataset of one group are very small compared to the others. Such circumstances lead classifier to ignore minority group and emphasize on majority class, that resulting with a skewed classification. We considered sixteen riboswitch families, to be in accord with recent riboswitch classification work, that contain imbalanced dataset ranging from 4,826 instances (RF00174) to 39 (RF01051) instances. The dataset was divided into training and test set using new developed pipeline. From 5460 k-mers, 156 features were produced calculated based on CfsSubsetEval and BestFirst. Statistically tested result was significantly difference between balanced and imbalanced dataset (p < 0.05). Besides, each algorithm also showed a significant difference in sensitivity, specificity, accuracy, and macro F-score when used in both groups (p < 0.05). Several k-mers clustered from heat map were discovered to have biological functions and motifs at the different positions like interior loops, terminal loops and helices. They were validated to have a biological function and some are riboswitch Beyene et al.2 motifs. The analysis has discovered the importance of solving the challenges of majority bias analysis and overfitting. Presented results were generalized evaluation of both balanced and imbalanced models, which implies their ability of classifying novel riboswitches. The scientific community can use python source code at https://github.com/Seasonsling/riboswitch, which can contribute to the process of developing software packages.
Author SummaryMachine learning application has been used in many ways in bioinformatics and computational biology. Its use in riboswitch classification is still limited and existing attempt showed challenges due to imbalanced dataset. Algorithms classify dataset with majority and minority group, but they tend to ignore minority group and emphasize on majority class, consequential return a skewed classification We used new pipeline including SMOTE for balancing datasets that showed better classified riboswitch as well as improved performance of algorithms selected. Statistically significant difference observed between balanced and imbalanced in sensitivity, specificity, accuracy and F-score, this proved balanced dataset better for classification of riboswitch. Biological functions and motif search of k-mers in riboswitch families revealed their presence in interior loops, terminal loops and helices, some of the kmers were reported to be riboswitch motifs of aptamer domains and critical for metabolite binding. The pipeline can be used in machine learning and deep learning study in other domains of bioinformatics and computational biology suffering from imbalanced dataset. Finally, scientific community can use python source code, the work done and flow to develop packages.