To confront the global threat of coronavirus disease 2019, a massive number of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome sequences have been decoded, with the results promptly released through the GISAID database. Based on variant types, eight clades have already been defined in GISAID, but the diversity can be far greater. Owing to the explosive increase in available sequences, it is important to develop new technologies that can easily grasp the whole picture of the big-sequence data and support efficient knowledge discovery. An ability to efficiently clarify the detailed time-series changes in genome-wide mutation patterns will enable us to promptly identify and characterize dangerous variants that rapidly increase their population frequency. Here, we collectively analyzed over 150,000 SARS-CoV-2 genomes to understand their overall features and time-dependent changes using a batch-learning self-organizing map (BLSOM) for oligonucleotide composition, which is an unsupervised machine learning method. BLSOM can separate clades defined by GISAID with high precision, and each clade is subdivided into clusters, which shows a differential increase/decrease pattern based on geographic region and time. This allowed us to identify prevalent strains in each region and to show the commonality and diversity of the prevalent strains. Comprehensive characterization of the oligonucleotide composition of SARS-CoV-2 and elucidation of time-series trends of the population frequency of variants can clarify the viral adaptation processes after invasion into the human population and the time-dependent trend of prevalent epidemic strains across various regions, such as continents.
To confront the global threat of coronavirus disease 2019, a massive number of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome sequences have been decoded, with the results promptly released through the GISAID database. Based on variant types, eight clades have already been defined in GISAID, but the diversity can be far greater. Owing to the explosive increase in available sequences, it is important to develop new technologies that can easily grasp the whole picture of the big-sequence data and support efficient knowledge discovery. An ability to efficiently clarify the detailed time-series changes in genome-wide mutation patterns will enable us to promptly identify and characterize dangerous variants that rapidly increase their population frequency. Here, we collectively analyzed over 150,000 SARS-CoV-2 genomes to understand their overall features and time-dependent changes using a batch-learning self-organizing map (BLSOM) for oligonucleotide composition, which is an unsupervised machine learning method. BLSOM can separate clades defined by GISAID with high precision, and each clade is subdivided into clusters, which shows a differential increase/decrease pattern based on geographic region and time. This allowed us to identify prevalent strains in each region and to show the commonality and diversity of the prevalent strains. Comprehensive characterization of the oligonucleotide composition of SARS-CoV-2 and elucidation of time-series trends of the population frequency of variants can clarify the viral adaptation processes after invasion into the human population and the time-dependent trend of prevalent epidemic strains across various regions, such as continents.
There are currently no abstract classifiers, which can be used for new diagnostic test accuracy (DTA) systematic reviews to select primary DTA study abstracts from database searches. Our goal was to develop machine‐learning‐based abstract classifiers for new DTA systematic reviews through an open competition. We prepared a dataset of abstracts obtained through database searches from 11 reviews in different clinical areas. As the reference standard, we used the abstract lists that required manual full‐text review. We randomly splitted the datasets into a train set, a public test set, and a private test set. Competition participants used the training set to develop classifiers and validated their classifiers using the public test set. The classifiers were refined based on the performance of the public test set. They could submit as many times as they wanted during the competition. Finally, we used the private test set to rank the submitted classifiers. To reduce false exclusions, we used the Fbeta measure with a beta set to seven for evaluating classifiers. After the competition, we conducted the external validation using a dataset from a cardiology DTA review. We received 13,774 submissions from 1429 teams or persons over 4 months. The top‐honored classifier achieved a Fbeta score of 0.4036 and a recall of 0.2352 in the external validation. In conclusion, we were unable to develop an abstract classifier with sufficient recall for immediate application to new DTA systematic reviews. Further studies are needed to update and validate classifiers with datasets from other clinical areas.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.