Methodological insights into ChatGPT's screening performance in systematic reviews.

AI Article screening ChatGPT GPT Large language model Radiology Systematic review

Journal

BMC medical research methodology

ISSN: 1471-2288

Titre abrégé: BMC Med Res Methodol

Pays: England

ID NLM: 100968545

Informations de publication

Date de publication:
27 Mar 2024

Historique:

received: 05 09 2023

accepted: 18 03 2024

medline: 28 3 2024

pubmed: 28 3 2024

entrez: 28 3 2024

Statut: epublish

Résumé

The screening process for systematic reviews and meta-analyses in medical research is a labor-intensive and time-consuming task. While machine learning and deep learning have been applied to facilitate this process, these methods often require training data and user annotation. This study aims to assess the efficacy of ChatGPT, a large language model based on the Generative Pretrained Transformers (GPT) architecture, in automating the screening process for systematic reviews in radiology without the need for training data. A prospective simulation study was conducted between May 2nd and 24th, 2023, comparing ChatGPT's performance in screening abstracts against that of general physicians (GPs). A total of 1198 abstracts across three subfields of radiology were evaluated. Metrics such as sensitivity, specificity, positive and negative predictive values (PPV and NPV), workload saving, and others were employed. Statistical analyses included the Kappa coefficient for inter-rater agreement, ROC curve plotting, AUC calculation, and bootstrapping for p-values and confidence intervals. ChatGPT completed the screening process within an hour, while GPs took an average of 7-10 days. The AI model achieved a sensitivity of 95% and an NPV of 99%, slightly outperforming the GPs' sensitive consensus (i.e., including records if at least one person includes them). It also exhibited remarkably low false negative counts and high workload savings, ranging from 40 to 83%. However, ChatGPT had lower specificity and PPV compared to human raters. The average Kappa agreement between ChatGPT and other raters was 0.27. ChatGPT shows promise in automating the article screening phase of systematic reviews, achieving high sensitivity and workload savings. While not entirely replacing human expertise, it could serve as an efficient first-line screening tool, particularly in reducing the burden on human resources. Further studies are needed to fine-tune its capabilities and validate its utility across different medical subfields.

Sections du résumé

BACKGROUND BACKGROUND

METHODS METHODS

A prospective simulation study was conducted between May 2nd and 24th, 2023, comparing ChatGPT's performance in screening abstracts against that of general physicians (GPs). A total of 1198 abstracts across three subfields of radiology were evaluated. Metrics such as sensitivity, specificity, positive and negative predictive values (PPV and NPV), workload saving, and others were employed. Statistical analyses included the Kappa coefficient for inter-rater agreement, ROC curve plotting, AUC calculation, and bootstrapping for p-values and confidence intervals.

RESULTS RESULTS

ChatGPT completed the screening process within an hour, while GPs took an average of 7-10 days. The AI model achieved a sensitivity of 95% and an NPV of 99%, slightly outperforming the GPs' sensitive consensus (i.e., including records if at least one person includes them). It also exhibited remarkably low false negative counts and high workload savings, ranging from 40 to 83%. However, ChatGPT had lower specificity and PPV compared to human raters. The average Kappa agreement between ChatGPT and other raters was 0.27.

CONCLUSIONS CONCLUSIONS

ChatGPT shows promise in automating the article screening phase of systematic reviews, achieving high sensitivity and workload savings. While not entirely replacing human expertise, it could serve as an efficient first-line screening tool, particularly in reducing the burden on human resources. Further studies are needed to fine-tune its capabilities and validate its utility across different medical subfields.

Identifiants

DOI: 10.1186/s12874-024-02203-8 PMID: 38539117

pubmed: 38539117

doi: 10.1186/s12874-024-02203-8

pii: 10.1186/s12874-024-02203-8

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

Informations de copyright

Références

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.

doi: 10.1038/nature14539 pubmed: 26017442

Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv pre-print server. 2016.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. 2017.

Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018.

Alec R, Karthik N. Improving language understanding by generative pre-training. 2018.

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1).

Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language Models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. 2020. p. 1877–901.

Alec R, Jeff W, Rewon C, David L, Dario A, Ilya S. Language models are unsupervised multitask learners. 2019.

Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. 2023.

Alessandro L, Douglas GA, Jennifer Marie T, Cynthia DM, Peter Christian G, John PAI, et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. PLoS Med. 2009;6.

Byron CW, Kevin S, Carla EB, Joseph L, Thomas AT. Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. 2012.

Ana Helena Salles dos R, Ana Luiza Miranda de O, Carolina F, James Z, Paulo F, Janaine Cunha P. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst Rev. 2023;12.

Kevin EKC, Robin LJL, Daniel FG, Leo N. Research Screener: a machine learning tool to semi-automate abstract screening for systematic reviews. Syst Rev. 2021;10.

Amir V, Mana M, Amin N-A, Seyed Hossein Hosseini A, Mehrnush Saghab T, Reyhaneh A, et al. Abstract screening using the automated tool Rayyan: results of effectiveness in three diagnostic test accuracy systematic reviews. BMC Med Res Methodol. 2022;22.

The EndNote Team. EndNote. EndNote X9 ed. Philadelphia, PA: Clarivate; 2013.

McKinney W. Data Structures for Statistical Computing in Python2010. 56–61 p.

Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–62.

doi: 10.1038/s41586-020-2649-2 pubmed: 32939066 pmcid: 7759461

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in P ython. J Mach Learn Res. 2011;12:2825–30.

Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9:90–5.

doi: 10.1109/MCSE.2007.55

Waskom M. seaborn: statistical data visualization. J Open Source Softw. 2021;6:3021.

doi: 10.21105/joss.03021

Viera AJ, Garrett JM. Understanding interobserver agreement: the kappa statistic. Fam Med. 2005;37(5):360–3.

pubmed: 15883903

Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74.

doi: 10.1016/j.patrec.2005.10.010

Schisterman EF, Perkins NJ, Liu A, Bondell H. Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples. Epidemiology. 2005;16(1):73–81.

doi: 10.1097/01.ede.0000147512.81966.ba pubmed: 15613948

Jaccard index: Wikipedia; 2023. updated 2023, May 21. Available from: https://en.wikipedia.org/wiki/Jaccard_index .

Efron B, Tibshirani RJ. An Introduction to the Bootstrap. 1st ed. New York: Chapman and Hall/CRC; 1994.

doi: 10.1201/9780429246593

Rayyan—a web and mobile app for systematic reviews. Syst Rev. 2016;5(1):210.

Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA, editors. Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. International Health Informatics Symposium; 2012.

Kahili-Heede MK, Hillgren KJ. Colandr. J Med Library Assoc. 2021;109:523–5.

dos Reis AHS, de Oliveira ALM, Fritsch C, Zouch J, Ferreira P, Polese JC. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst Rev. 2023;12.

Bannach-Brown A, Przybyła P, Thomas J, Rice ASC, Ananiadou S, Liao J, Macleod MR. Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error. Syst Rev. 2019;8(1):23.

doi: 10.1186/s13643-019-0942-7 pubmed: 30646959 pmcid: 6334440

Gates A, Johnson C, Hartling L. Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool. Syst Rev. 2018;7.

Rathbone J, Hoffmann TC, Glasziou PP. Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers. Syst Rev. 2015;4.

Gates A, Guitard S, Pillay J, Elliott SA, Dyson MP, Newton AS, Hartling L. Performance and usability of machine learning for screening in systematic reviews: a comparative evaluation of three tools. Syst Rev. 2019;8(1):278.

doi: 10.1186/s13643-019-1222-2 pubmed: 31727150 pmcid: 6857345

Methley A, Campbell SM, Chew‐Graham CA, McNally R, Cheraghi-Sohi S. PICO, PICOS and SPIDER: a comparison study of specificity and sensitivity in three search tools for qualitative systematic reviews. BMC Health Serv Res. 2014;14.

Booth A. Clear and present questions: formulating questions for evidence based practice. Library Hi Tech. 2006;24(3):355–68.

doi: 10.1108/07378830610692127

Wildridge V, Bell L. How CLIP became ECLIPSE: a mnemonic to assist in searching for health policy/management information. Health Info Libr J. 2002;19(2):113–5.

doi: 10.1046/j.1471-1842.2002.00378.x pubmed: 12389609

Wang B, Deng X, Sun H, editors. Iteratively Prompt Pre-trained Language Models for Chain of Thought2022 December; Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.

Methodological insights into ChatGPT's screening performance in systematic reviews.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Mahbod Issaiy (M)

Hossein Ghanaati (H)

Shahriar Kolahi (S)

Madjid Shakiba (M)

Amir Hossein Jalali (AH)

Diana Zarei (D)

Sina Kazemian (S)

Mahsa Alborzi Avanaki (MA)

Kavous Firouznia (K)

Classifications MeSH