Review of artificial intelligence‐based question‐answering systems in healthcare

Cilar, Leona; Gosak, Lucija; Štiglic, Gregor

doi:10.1002/widm.1487

Cited by 27 publications

(9 citation statements)

References 75 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Considering the rising prevalence of psychiatric disorders and concomitant challenges in providing care, it seemed likely that nonprofessionals would also turn to the chatbot for mental health issues at the time of GPT-3.5's release [8,49,50]. Hence, it is conceivable that GPT-3.5's training data set includes not only a substantial and reliable portion of psychiatric data, but also its developers might have first fine-tuned ChatGPT specifically in this domain in anticipation of its high demand [51][52][53]. Thus, the developers might have also fine-tuned GPT-4 specifically in internal medicine and surgery, possibly reacting to a high demand in this area from users of its' predecessor.…”

Section: Principal Findingsmentioning

confidence: 99%

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Meyer,

Riese,

Streichert

2024

JMIR Med Educ

View full text Add to dashboard Cite

Background The potential of artificial intelligence (AI)–based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance. Objective This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination. Methods To assess GPT-3.5’s and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022. Results GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research. Conclusions The study results highlight ChatGPT’s remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4’s predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population.

show abstract

Section: Principal Findingsmentioning

confidence: 99%

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Meyer,

Riese,

Streichert

2024

JMIR Med Educ

View full text Add to dashboard Cite

show abstract

“…A specific subset of AI in healthcare is patient-facing conversational AI agents and chatbots, which directly interact with patients to perform tasks ranging from symptom self-diagnosis and treatment recommendations to medication management [ 6 ]. These include various modalities such as text-based chatbots [ 7 ], voice assistants [ 8 ], and wearable devices [ 9 ].…”

Section: Introductionmentioning

confidence: 99%

Achieving health equity through conversational AI: A roadmap for design and implementation of inclusive chatbots in healthcare

Nadarzynski,

Knights,

Husbands

et al. 2024

PLOS Digit Health

View full text Add to dashboard Cite

Background The rapid evolution of conversational and generative artificial intelligence (AI) has led to the increased deployment of AI tools in healthcare settings. While these conversational AI tools promise efficiency and expanded access to healthcare services, there are growing concerns ethically, practically and in terms of inclusivity. This study aimed to identify activities which reduce bias in conversational AI and make their designs and implementation more equitable. Methods A qualitative research approach was employed to develop an analytical framework based on the content analysis of 17 guidelines about AI use in clinical settings. A stakeholder consultation was subsequently conducted with a total of 33 ethnically diverse community members, AI designers, industry experts and relevant health professionals to further develop a roadmap for equitable design and implementation of conversational AI in healthcare. Framework analysis was conducted on the interview data. Results A 10-stage roadmap was developed to outline activities relevant to equitable conversational AI design and implementation phases: 1) Conception and planning, 2) Diversity and collaboration, 3) Preliminary research, 4) Co-production, 5) Safety measures, 6) Preliminary testing, 7) Healthcare integration, 8) Service evaluation and auditing, 9) Maintenance, and 10) Termination. Discussion We have made specific recommendations to increase conversational AI’s equity as part of healthcare services. These emphasise the importance of a collaborative approach and the involvement of patient groups in navigating the rapid evolution of conversational AI technologies. Further research must assess the impact of recommended activities on chatbots’ fairness and their ability to reduce health inequalities.

show abstract

“…To bridge this gap, question-answering (QA) systems have emerged as a tool to enhance knowledge and understanding on numerous topics by providing short and precise answers to questions posed in natural language. 24 This is achieved through natural language processing (NLP), a branch of artificial intelligence (AI) with rapid developments and vast applications using large language models (LLM) for QA. Question-answering systems possess an abundance of domain knowledge, where biomedical QA systems can be trained on evidence-based medical information to increase the accessibility of expert opinions.…”

Section: Introductionmentioning

confidence: 99%

“…This mimics direct access to an expert by providing timely and accurate responses to user's queries, allowing them to access evidence-based information in real-time. These QA systems have been applied for use in clinical decision support, 25 , 26 medical examinations, 27 , 28 consumer health questions 29 and to improve numerous health outcomes, 24 , 25 including sleep outcomes in university settings. 30 However, despite the QA system's abundance of knowledge, providing information to patients alone in this form is unlikely to be sufficient to promote behavioural change.…”

Section: Introductionmentioning

confidence: 99%

A pilot randomised controlled trial exploring the feasibility and efficacy of a human-AI sleep coaching model for improving sleep among university students

Liu,

Ito,

Ngo

et al. 2024

DIGITAL HEALTH

View full text Add to dashboard Cite

Objective Sleep quality is a crucial concern, particularly among youth. The integration of health coaching with question-answering (QA) systems presents the potential to foster behavioural changes and enhance health outcomes. This study proposes a novel human-AI sleep coaching model, combining health coaching by peers and a QA system, and assesses its feasibility and efficacy in improving university students’ sleep quality. Methods In a four-week unblinded pilot randomised controlled trial, 59 university students (mean age: 21.9; 64% males) were randomly assigned to the intervention (health coaching and QA system; n = 30) or the control conditions (QA system; n = 29). Outcomes included efficacy of the intervention on sleep quality (Pittsburgh Sleep Quality Index; PSQI), objective and self-reported sleep measures (obtained from Fitbit and sleep diaries) and feasibility of the study procedures and the intervention. Results Analysis revealed no significant differences in sleep quality (PSQI) between intervention and control groups (adjusted mean difference = −0.51, 95% CI: [−1.55–0.77], p = 0.40). The intervention group demonstrated significant improvements in Fitbit measures of total sleep time (adjusted mean difference = 32.5, 95% CI: [5.9–59.1], p = 0.02) and time in bed (adjusted mean difference = 32.3, 95% CI: [2.7–61.9], p = 0.03) compared to the control group, although other sleep measures were insignificant. Adherence was high, with the majority of the intervention group attending all health coaching sessions. Most participants completed baseline and post-intervention self-report measures, all diary entries, and consistently wore Fitbits during sleep. Conclusions The proposed model showed improvements in specific sleep measures for university students and the feasibility of the study procedures and intervention. Future research may extend the intervention period to see substantive sleep quality improvements.

show abstract

Review of artificial intelligence‐based question‐answering systems in healthcare

Cited by 27 publications

References 75 publications

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Achieving health equity through conversational AI: A roadmap for design and implementation of inclusive chatbots in healthcare

A pilot randomised controlled trial exploring the feasibility and efficacy of a human-AI sleep coaching model for improving sleep among university students

Contact Info

Product

Resources

About