Establishing key research questions for the implementation of artificial intelligence in colonoscopy: a modified Delphi method

Ahmad, Omer F.; Mori, Yuichi; Misawa, Masashi; Kudo, Toyoki; Anderson, John T.; Bernal, Jorge; Berzin, Tyler M.; Bisschops, Raf; Byrne, Michael F.; Chen, Peng‐Jen; East, James E.; Eelbode, Tom; Elson, Daniel S.; Gurudu, Suryakanth R.; Histace, Aymeric; Karnes, William E.; Repici, Alessandro; Singh, Rajvinder; Valdastri, Pietro; Wallace, Michael B.; Wang, Pu; Stoyanov, Danail; Lovat, Laurence

doi:10.1055/a-1306-7590

Cited by 46 publications

(39 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These potential advances are mainly expected from artificial neural networks, specifically deep learning-based methods 4 . Safe and efficient adoption of ML tools in clinical gastroenterology requires a thorough understanding of the performance metrics of the resulting models and confirmation of their clinical utility 5 .…”

mentioning

confidence: 99%

On evaluation metrics for medical applications of artificial intelligence

Hicks

Strümke

Thambawita

et al. 2021

Preprint

View full text Add to dashboard Cite

Clinicians and model developers need to understand how proposed machine learning (ML) models could improve patient care. In fact, no single metric captures all the desirable properties of a model and several metrics are typically reported to summarize a model's performance. Unfortunately, these measures are not easily understandable by many clinicians. Moreover, comparison of models across studies in an objective manner is challenging, and no tool exists to compare models using the same performance metrics. This paper looks at previous ML studies done in gastroenterology, provides an explanation of what different metrics mean in the context of the presented studies, and gives a thorough explanation of how different metrics should be interpreted. We also release an open source web-based tool that may be used to aid in calculating the most relevant metrics presented in this paper so that other researchers and clinicians may easily incorporate them into their research.

show abstract

mentioning

confidence: 99%

On evaluation metrics for medical applications of artificial intelligence

Hicks

Strümke

Thambawita

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…An example of a simple method that could be used to reduce FPs is re-training the CADe algorithms with scenarios that currently lead to FPs. Another approach could be the adoption of recurrent neural networks, which have memory and can process temporal sequences of frames in a way that is similar to the learning process of human brains [ 10 ]. Misawa et al reported that when they changed their old algorithm [ 17 ] to YoloV3 (You Only Look Once, Version 3), a state-of-the-art, real-time object detection algorithm, better specificity was achieved (increasing from 90.9% to 93.7%) [ 19 ].…”

Section: How To Address the Occurrence Of Fpsmentioning

confidence: 99%

“…An accompanying limitation of the CADe is false positives (FPs), which occur when the algorithm identifies a “polyp” that the endoscopist would disagree with. FPs were ranked 3rd in importance among 59 future research questions related to CADe [ 10 ]. Therefore, we conducted this systemic review on the definitions, causes, and adverse effects of the CADe FPs.…”

Section: Introductionmentioning

confidence: 99%

Computer-Aided Detection False Positives in Colonoscopy

et al. 2021

View full text Add to dashboard Cite

Randomized control trials and meta-analyses comparing colonoscopies with and without computer-aided detection (CADe) assistance showed significant increases in adenoma detection rates (ADRs) with CADe. A major limitation of CADe is its false positives (FPs), ranked 3rd in importance among 59 research questions in a modified Delphi consensus review. The definition of FPs varies. One commonly used definition defines an FP as an activation of the CADe system, irrespective of the number of frames or duration of time, not due to any polypoid or nonpolypoid lesions. Although only 0.07 to 0.2 FPs were observed per colonoscopy, video analysis studies using FPs as the primary outcome showed much higher numbers of 26 to 27 per colonoscopy. Most FPs were of short duration (91% < 0.5 s). A higher number of FPs was also associated with suboptimal bowel preparation. The appearance of FPs can lead to user fatigue. The polypectomy of FPs results in increased procedure time and added use of resources. Re-training the CADe algorithms is one way to reduce FPs but is not practical in the clinical setting during colonoscopy. Water exchange (WE) is an emerging method that the colonoscopist can use to provide salvage cleaning during insertion. We discuss the potential of WE for reducing FPs as well as the augmentation of ADRs through CADe.

show abstract

“…However, it is difficult to compare different AI algorithms/settings/products due to the lack of established definition and measurement criteria for false-positive alarms in AI-assisted colonoscopy. This problem was also emphasized as one of the most crucial issues at a recent international expert meeting [4].…”

mentioning

confidence: 99%