2021
DOI: 10.3389/fcimb.2021.610348
|View full text |Cite
|
Sign up to set email alerts
|

Learning From Limited Data: Towards Best Practice Techniques for Antimicrobial Resistance Prediction From Whole Genome Sequencing Data

Abstract: Antimicrobial resistance prediction from whole genome sequencing data (WGS) is an emerging application of machine learning, promising to improve antimicrobial resistance surveillance and outbreak monitoring. Despite significant reductions in sequencing cost, the availability and sampling diversity of WGS data with matched antimicrobial susceptibility testing (AST) profiles required for training of WGS-AST prediction models remains limited. Best practice machine learning techniques are required to ensure traine… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1

Relationship

2
6

Authors

Journals

citations
Cited by 24 publications
(18 citation statements)
references
References 52 publications
0
18
0
Order By: Relevance
“…This problem is described by the fact that it is generally more cost and time effective to screen for a large number of variants within an individual than it is to screen large numbers of individuals [ 62 ] as is common in the fields of, e.g., neuroimaging, genomics, motion tracking, eye tracking, and many other technology-based data collection methods that have led to a torrent of high-dimensional datasets. This is a well-known area where classical machine learning algorithms do not perform well [ 63 , 64 ]. However, despite small sample sizes being common and the fact that limited data are problematic for pattern recognition, only a limited number of papers have systematically investigated how the machine learning validation process should be designed to help avoid optimistic performance estimates.…”
Section: Discussionmentioning
confidence: 99%
“…This problem is described by the fact that it is generally more cost and time effective to screen for a large number of variants within an individual than it is to screen large numbers of individuals [ 62 ] as is common in the fields of, e.g., neuroimaging, genomics, motion tracking, eye tracking, and many other technology-based data collection methods that have led to a torrent of high-dimensional datasets. This is a well-known area where classical machine learning algorithms do not perform well [ 63 , 64 ]. However, despite small sample sizes being common and the fact that limited data are problematic for pattern recognition, only a limited number of papers have systematically investigated how the machine learning validation process should be designed to help avoid optimistic performance estimates.…”
Section: Discussionmentioning
confidence: 99%
“…Organism–compound datasets with fewer than 100 susceptible and 100 resistant isolates were excluded. Filtered datasets were partitioned into training and test sets (80%:20%) using a genome-distance-based method [ 17 ]. This dataset partitioning method is designed to reduce similarity between the training and the test dataset.…”
Section: Methodsmentioning
confidence: 99%
“…ML-based WGS-AST typically uses nucleotide k-mer representations of either input genome assemblies or raw sequencing reads [ 14 , 15 , 16 , 17 , 18 ]. K-mer sets have been successfully used for various bioinformatics analyses, ranging from species identification [ 19 ] to genome assembly [ 20 ], as they offer advantages in computing efficiency and speed.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Only a subset of ML algorithms is capable of effectively making use of high-dimensional data while minimizing overfitting [ 30 ]. Likewise, rigorous validation on independently sampled datasets is required for robust estimation of model performance in the general case [ 45 , 71 ]. While the increasing availability of datasets with both NGS and AST data will help in improving performance and generalizability, more research is required to establish guidelines for sampling and validation of pAST ML models that can support clinical applications.…”
Section: Current Limitations and Perspectivesmentioning
confidence: 99%