On evaluation metrics for medical applications of artificial intelligence

Hicks, Steven A.; Strümke, Inga; Thambawita, Vajira; Hammou, Malek; Riegler, Michael; Halvorsen, Pål; Parasa, Sravanthi

doi:10.1101/2021.04.07.21254975

Cited by 37 publications

(34 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This is consistent with another surprising finding that ABNN prospective prediction sensitivity (recall) and ABNN prospective prediction precision are strongly and positively correlated across validation subgroups (Exact Pearson r = 0.808; Fig. 3E), in stark contrast to the typical inverse relationship between precision and recall for binary classifications (20). This finding reveals that ABNN performance is primarily driven by improved disease understanding that unbiasedly reduces both false positives and false negatives, rather than by biased cutoff tuning that trades off false positives against false negatives.…”

Section: Resultssupporting

confidence: 91%

“…Accordingly, we chose to report prospective prediction sensitivity (recall) in the context of F1 score, and prospective prediction precision against the expert benchmark as the most appropriate performance metrics for PROTOCOLS validation test ( 19 ) to reflect the descending priorities (maximize true positives > minimize false negatives > minimize false positives > maximize true negatives) in actual clinical settings to deliver maximal clinical benefit to patients at minimal clinical cost to sponsors (Methods section ‘Metrics definition’ and ‘Model training’) ( 20, 21 ). We also report other standard measures (accuracy, specificity, etc.)…”

mentioning

confidence: 99%

See 1 more Smart Citation

Prospectively validated augmented intelligence for disease-agnostic predictions of clinical success for novel therapeutics

Lovetrue¹,

Lovetrue²

2022

Preprint

View full text Add to dashboard Cite

Despite numerous superhuman achievements in complex challenges, standalone AI does not free life science from the long-term bottleneck of linearly extracting new knowledge from exponentially growing new data, severely limiting the success rate of drug discovery. Inspired by the state-of-the-art AI training methods, we trained a human-centric hybrid augmented intelligence (HAI) to learn a foundation model6 that extracts all-encompassing knowledge of human physiology and diseases. To evaluate the quality of HAI's extracted knowledge, we designed the public, prospective prediction of pivotal ongoing clinical trial outcomes at large scale (PROTOCOLS) challenge to benchmark HAI's real-world performance of predicting drug clinical efficacy without access to human data. HAI achieved a 10.5-fold improvement from the baseline with 99% confidence in the PROTOCOLS validation, readily increasing the average clinical success rate of investigational new drugs from 7.9% to 90% for almost any human diseases. The validated HAI confirms that exponentially extracted knowledge alone is sufficient for accurately predicting drug clinical efficacy, effecting a total reversal of Eroom's aw. HAI is also the world's first clinically validated model of human aging that could substantially speed up the discovery of preventive medicine for all age-related diseases. Our results demonstrate that disruptive breakthroughs necessitate the smallest team size to attain the largest HAI for optimal knowledge extraction from high-dimensional low-quality data space, thus establishing the first prospective proof of the previous discovery that small teams disrupt. The global adoption of training HAI provides a well-beaten economic path to mass-produce scientific and technological breakthroughs via exponential knowledge extraction and better designs of data labels for training better performing AI.

show abstract

Section: Resultssupporting

confidence: 91%

mentioning

confidence: 99%

Prospectively validated augmented intelligence for disease-agnostic predictions of clinical success for novel therapeutics

Lovetrue¹,

Lovetrue²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In minority-event detection, metrics such as sensitivity and specificity (i.e. true positive and true negative rates) are often preferred depending on the relative importance of Type I and Type II errors in the given medical context [38]. To simplify presentation of results, we report the Balanced Accuracy, an average of specificity and sensitivity.…”

Section: B Ontology Of Methodsmentioning

confidence: 99%

Continual learning of longitudinal health records

Armstrong¹,

Clifton²

2021

Preprint

View full text Add to dashboard Cite

Continual learning denotes machine learning methods which can adapt to new environments while retaining and reusing knowledge gained from past experiences. Such methods address two issues encountered by models in non-stationary environments: ungeneralisability to new data, and the catastrophic forgetting of previous knowledge when retrained. This is a pervasive problem in clinical settings where patient data exhibits covariate shift not only between populations, but also continuously over time. However, while continual learning methods have seen nascent success in the imaging domain, they have been little applied to the multi-variate sequential data characteristic of critical care patient recordings. Here we evaluate a variety of continual learning methods on longitudinal ICU data in a series of representative healthcare scenarios. We find that while several methods mitigate short-term forgetting, domain shift remains a challenging problem over large series of tasks, with only replay based methods achieving stable long-term performance.Code for reproducing all experiments can be found at https://github.com/iacobo/continual Keywords Continual learning • domain adaptation • time series • clinical machine learning • EHR

show abstract

“…48, reporting of a single metric such as Sensitivity, PPV or Specificity can be highly misleading because, for example, non-informative classifiers can achieve high values on imbalanced classes. The F 1 Score (also known as DSC in the context of segmentation), overcomes this issue by representing the harmonic mean of PPV and Sensitivity and therefore penalizing extreme values of either metric [38], while being relatively robust against imbalanced data sets [89]. The F 1 Score is a specification of the F𝛽 score, which adds a weighting between PPV and Sensitivity.…”

Section: G1 Image-level Classificationmentioning

confidence: 99%

Metrics reloaded: Pitfalls and recommendations for image analysis validation

Maier-Hein¹,

Reinke²,

Godau³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The field of automatic biomedical image analysis crucially depends on robust and meaningful performance metrics for algorithm validation. Current metric usage, however, is often ill-informed and does not reflect the underlying domain interest. Here, we present a comprehensive framework that guides researchers towards choosing performance metrics in a problem-aware manner. Specifically, we focus on biomedical image analysis problems that can be interpreted as a classification task at image, object or pixel level. The framework first compiles domain interest-, target structure-, data set-and algorithm output-related properties of a given problem into a problem fingerprint, while also mapping it to the appropriate problem category, namely image-level classification, semantic segmentation, instance segmentation, or object detection. It then guides users through the process of selecting and applying a set of appropriate validation metrics while making them aware of potential pitfalls related to individual choices. In this paper, we describe the current status of the Metrics Reloaded recommendation framework, with the goal of obtaining constructive feedback from the image analysis community. The current version has been developed within an international consortium of more than 60 image analysis experts and will be made openly available as a user-friendly toolkit after community-driven optimization.

show abstract

On evaluation metrics for medical applications of artificial intelligence

Cited by 37 publications

References 19 publications

Prospectively validated augmented intelligence for disease-agnostic predictions of clinical success for novel therapeutics

Prospectively validated augmented intelligence for disease-agnostic predictions of clinical success for novel therapeutics

Continual learning of longitudinal health records

Metrics reloaded: Pitfalls and recommendations for image analysis validation

Contact Info

Product

Resources

About