To better understand the molecular basis of respiratory diseases of viral origin, high-throughput gene-expression data are frequently taken by means of DNA microarray or RNA-seq technology. Such data can also be useful to classify infected individuals by molecular signatures in the form of machine-learning models with genes as predictor variables. Early diagnosis of patients by molecular signatures could also contribute to better treatments. An approach that has rarely been considered for machine-learning models in the context of transcriptomics is data augmentation. For other data types it has been shown that augmentation can improve classification accuracy and prevent overfitting. Here, we compare three strategies for data augmentation of DNA microarray and RNA-seq data from two selected studies on respiratory diseases of viral origin. The first study involves samples of patients with either viral or bacterial origin of the respiratory disease, the second study involves patients with either SARS-CoV-2 or another respiratory virus as disease origin. Specifically, we reanalyze these public datasets to study whether patient classification by transcriptomic signatures can be improved when adding artificial data for training of the machine-learning models. Our comparison reveals that augmentation of transcriptomic data can improve the classification accuracy and that fewer genes are necessary as explanatory variables in the final models. We also report genes from our signatures that overlap with signatures presented in the original publications of our example data. Due to strict selection criteria, the molecular role of these genes in the context of respiratory infectious diseases is underlined.
Background: Estimating the taxonomic composition of viral sequences in a biological sample processed by next-generation sequencing is an important step for comparative metagenomics. For that purpose, sequencing reads are usually classified by mapping them against a database of known viral reference genomes. This fails, however, to classify reads from novel viruses and quasispecies whose reference sequences are not yet available in public databases. Methods: In order to circumvent the problem of a mapping approach with unknown viruses, the feasibility and performance of neural networks to classify sequencing reads to taxonomic classes is studied. For that purpose, taxonomy and genome data from the NCBI database are used to sample artificial reads from known viruses with known taxonomic attribution. Based on these training data, artificial neural networks are fitted and applied to classify single viral read sequences to di erent taxa. Model building includes di erent input features derived from artificial read sequences as possible predictors which are chosen by a feature selection method. Training, validation and test data are computed from these input features. To summarise classification results, a generalised confusion matrix is proposed which lists all possible misclassification combination frequencies. Two new formulas to statistically estimate taxa frequencies are introduced for studying the overall viral composition.Results: We found that the best taxonomic level supported by the NCBI database is that of viral orders. Prediction accuracy of the fitted models is evaluated on test data and classification results are summarised in a confusion matrix, from which diagnostic measures such as sensitivity and specificity as well as positive and negative predictive values are calculated. The prediction accuracy of the artificial neural net is considerably higher than for random classification and posterior estimation of taxa frequencies is closer to the true distribution in the training data than simple classification or mapping results. Conclusions: Neural networks are helpful to classify sequencing reads into viral orders and can be used to complement the results of mapping approaches. The machine learning approach is not limited to already known viruses. In addition, statistical estimations of taxa frequencies can be used for subsequent comparative metagenomics.
Estimating the taxonomic composition of viral sequences in a biological samples processed by next-generation sequencing is an important step in comparative metagenomics. Mapping sequencing reads against a database of known viral reference genomes, however, fails to classify reads from novel viruses whose reference sequences are not yet available in public databases. Instead of a mapping approach, and in order to classify sequencing reads at least to a taxonomic level, the performance of artificial neural networks and other machine learning models was studied. Taxonomic and genomic data from the NCBI database were used to sample labelled sequencing reads as training data. The fitted neural network was applied to classify unlabelled reads of simulated and real-world test sets. Additional auxiliary test sets of labelled reads were used to estimate the conditional class probabilities, and to correct the prior estimation of the taxonomic distribution in the actual test set. Among the taxonomic levels, the biological order of viruses provided the most comprehensive data base to generate training data. The prediction accuracy of the artificial neural network to classify test reads to their viral order was considerably higher than that of a random classification. Posterior estimation of taxa frequencies could correct the primary classification results.
High‐dimensional gene expression data are regularly studied for their ability to separate different groups of samples by means of machine learning (ML) models. Meanwhile, a large number of such data are publicly available. Several approaches for meta‐analysis on independent sets of gene expression data have been proposed, mainly focusing on the step of feature selection, a typical step in fitting a ML model. Here, we compare different strategies of merging the information of such independent data sets to train a classifier model. Specifically, we compare the strategy of merging data sets directly (strategy A), and the strategy of merging the classification results (strategy B). We use simulations with pure artificial data as well as evaluations based on independent gene expression data from lung fibrosis studies to compare the two merging approaches. In the simulations, the number of studies, the strength of batch effects, and the separability are varied. The comparison incorporates five standard ML techniques typically used for high‐dimensional data, namely discriminant analysis, support vector machines, least absolute shrinkage and selection operator, random forest, and artificial neural networks. Using cross‐study validations, we found that direct data merging yields higher accuracies when having training data of three or four studies, and merging of classification results performed better when having only two training studies. In the evaluation with the lung fibrosis data, both strategies showed a similar performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.