Advancing health coaching: A comparative study of large language model and health coaches.

AI health coaching Human evaluation Q/A system Retrieval-augmented generation Sleep

Journal

Artificial intelligence in medicine
ISSN: 1873-2860
Titre abrégé: Artif Intell Med
Pays: Netherlands
ID NLM: 8915031

Informations de publication

Date de publication:
19 Oct 2024
Historique:
received: 02 04 2024
revised: 26 09 2024
accepted: 16 10 2024
medline: 26 10 2024
pubmed: 26 10 2024
entrez: 25 10 2024
Statut: aheadofprint

Résumé

Recent advances in large language models (LLM) offer opportunities to automate health coaching. With zero-shot learning ability, LLMs could revolutionize health coaching by providing better accessibility, scalability, and customization. The aim of this study is to compare the quality of responses to clients' sleep-related questions provided by health coaches and an LLM. From a de-identified dataset of coaching conversations from a pilot randomized controlled trial, we extracted 100 question-answer pairs comprising client questions and corresponding health coach responses. These questions were entered into a retrieval-augmented generation (RAG)-enabled open-source LLM (LLaMa-2-7b-chat) to generate LLM responses. Out of 100 question-answer pairs, 90 were taken out and assigned to three groups of evaluators: experts, lay-users, and GPT-4. Each group conducted two evaluation tasks: (Task 1) a single-response quality assessment spanning five criteria-accuracy, readability, helpfulness, empathy, and likelihood of harm-rated on a five-point Likert scale, and (Task 2) a pairwise comparison to choose the superior response between pairs. A suite of inferential statistical methods, including the paired and independent sample t-tests, Pearson correlation, and chi-square tests, were utilized to answer the study objective. Recognizing potential biases in human judgment, the remaining 10 question-answer pairs were used to assess inter-evaluator reliability among the human evaluators, quantified using the interclass correlation coefficient and percentage agreement metrics. Upon exclusion of incomplete data, the analysis included 178 single-response evaluations (Task 1) and 83 pairwise comparisons (Task 2). Expert and GPT-4 assessments revealed no discernible disparities in health coach and LLM responses across the five metrics. Contrarily, lay-users deemed LLM responses significantly more helpful than that of human coaches (p < 0.05). LLM responses were preferred in the majority (62.25 %, n = 155) of the aggregate 249 assessments, with all three evaluator groups favoring LLM over health coach inputs. While GPT-4 rated both health coach and LLM responses significantly higher than experts in terms of readability, helpfulness, and empathy, its ratings on accuracy and likelihood of harm aligned with those of experts. Response length positively correlated with accuracy and empathy scores, but negatively affected readability across all evaluator groups. Expert and lay-user evaluators demonstrated moderate to high inter-evaluator reliability. Our study showed encouraging findings by demonstrating that RAG-enabled LLM has comparable performance to health coaches in the domain tested. Serving as an initial step towards the creation of more sophisticated, adaptive, round-the-clock automated health coaching systems, our findings call for more extensive evaluation which could assist in the development of the model that could in the future lead to potential clinical implementation.

Identifiants

pubmed: 39454500
pii: S0933-3657(24)00246-X
doi: 10.1016/j.artmed.2024.103004
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

103004

Informations de copyright

Copyright © 2024 Elsevier B.V. All rights reserved.

Déclaration de conflit d'intérêts

Declaration of competing interest Frederick Sundram is on the Clinical Advisory Board for Clearhead, a digital ecosystem for promoting mental wellbeing. All other authors declared no known competing financial interests and personal relationships with individuals or organizations that could inappropriately influence the reported work.

Auteurs

Qi Chwen Ong (QC)

Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Rd, 308232, Singapore; School of Public Health, Imperial College London, 90 Wood Ln, London W12 0BZ, United Kingdom. Electronic address: qichwen.ong@ntu.edu.sg.

Chin-Siang Ang (CS)

Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Rd, 308232, Singapore.

Davidson Zun Yin Chee (DZY)

Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Rd, 308232, Singapore.

Ashwini Lawate (A)

Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Rd, 308232, Singapore.

Frederick Sundram (F)

Department of Psychological Medicine, Faculty of Medical and Health Sciences, University of Auckland, Auckland 1023, New Zealand.

Mayank Dalakoti (M)

Department of Cardiology, National University Heart Centre, 5 Lower Kent Ridge Rd, 119074, Singapore; Cardiovascular Metabolic Disease Translational Research Program, National University of Singapore, Singapore.

Leonardo Pasalic (L)

Haematology, Sydney Centres for Thrombosis and Haemostasis, Institute of Clinical Pathology and Medical Research (ICPMR), NSW Health Pathology, Westmead Hospital, Westmead, NSW, Australia; Westmead Clinical School, University of Sydney, Westmead, NSW, Australia.

Daniel To (D)

Department of Medicine, University of Wisconsin-Madison, Madison, WI, United States.

Tatiana Erlikh Fox (T)

Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Rd, 308232, Singapore; Onze Lieve Vrouwen Gasthuis, Jan Tooropstraat 164, 1061 AE Amsterdam, Netherlands.

Iva Bojic (I)

Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Rd, 308232, Singapore.

Josip Car (J)

Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Rd, 308232, Singapore; School of Life Course & Population Sciences, King's College London, Strand WC2R 2LS, United Kingdom.

Classifications MeSH