2023
DOI: 10.1101/2023.11.27.23299056
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Unveiling the Clinical Incapabilities: A Benchmarking Study of GPT-4V(ision) for Ophthalmic Multimodal Image Analysis

Pusheng Xu,
Xiaolan Chen,
Ziwei Zhao
et al.

Abstract: BackgroundsGPT4-V(ision) has generated great interest across various fields, while its performance in ocular multimodal images is still unknown. This study aims to evaluate the capabilities of a GPT-4V-based chatbot in addressing queries related to ocular multimodal images.MethodsA digital ophthalmologist app was built based on GPT-4V. The evaluation dataset comprised various ocular imaging modalities: slit-lamp, scanning laser ophthalmoscopy (SLO), fundus photography of the posterior pole (FPP), optical coher… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
5

Relationship

1
4

Authors

Journals

citations
Cited by 7 publications
(2 citation statements)
references
References 28 publications
0
2
0
Order By: Relevance
“…Furthermore, although GPT-4V performs well in commonsense visual question answering, it is prone to hallucinations when world knowledge is required, such as about real-world objects [21], especially for objects from non-Western countries [22]. A similar pattern has been observed for medical images, where GPT-4V does not seem to possess the knowledge required for making accurate diagnoses or reports [23,24]. Guan et al .…”
Section: Introductionmentioning
confidence: 85%
“…Furthermore, although GPT-4V performs well in commonsense visual question answering, it is prone to hallucinations when world knowledge is required, such as about real-world objects [21], especially for objects from non-Western countries [22]. A similar pattern has been observed for medical images, where GPT-4V does not seem to possess the knowledge required for making accurate diagnoses or reports [23,24]. Guan et al .…”
Section: Introductionmentioning
confidence: 85%
“…29,31 A small proportion of misinformation and hallucination in responses existed among LLMs, with Bing Chat and GPT-3 (61.5%-77%) having reduced precision than GPT-4 (80.5%-93%-see Table 1). For image-based diagnosis, the performance of GPT-4V(ision) was poor (30.6%, 26.2%-67.5%) [36][37][38][39] with significantly lower median accuracy compared with humans (p = 0.01) (Table 2).…”
Section: Performance In Diagnosing Ophthalmic Diseases and Triage Acc...mentioning
confidence: 99%