Methodological insights into ChatGPT's screening performance in systematic reviews.

AI Article screening ChatGPT GPT Large language model Radiology Systematic review

Journal

BMC medical research methodology
ISSN: 1471-2288
Titre abrégé: BMC Med Res Methodol
Pays: England
ID NLM: 100968545

Informations de publication

Date de publication:
27 Mar 2024
Historique:
received: 05 09 2023
accepted: 18 03 2024
medline: 28 3 2024
pubmed: 28 3 2024
entrez: 28 3 2024
Statut: epublish

Résumé

The screening process for systematic reviews and meta-analyses in medical research is a labor-intensive and time-consuming task. While machine learning and deep learning have been applied to facilitate this process, these methods often require training data and user annotation. This study aims to assess the efficacy of ChatGPT, a large language model based on the Generative Pretrained Transformers (GPT) architecture, in automating the screening process for systematic reviews in radiology without the need for training data. A prospective simulation study was conducted between May 2nd and 24th, 2023, comparing ChatGPT's performance in screening abstracts against that of general physicians (GPs). A total of 1198 abstracts across three subfields of radiology were evaluated. Metrics such as sensitivity, specificity, positive and negative predictive values (PPV and NPV), workload saving, and others were employed. Statistical analyses included the Kappa coefficient for inter-rater agreement, ROC curve plotting, AUC calculation, and bootstrapping for p-values and confidence intervals. ChatGPT completed the screening process within an hour, while GPs took an average of 7-10 days. The AI model achieved a sensitivity of 95% and an NPV of 99%, slightly outperforming the GPs' sensitive consensus (i.e., including records if at least one person includes them). It also exhibited remarkably low false negative counts and high workload savings, ranging from 40 to 83%. However, ChatGPT had lower specificity and PPV compared to human raters. The average Kappa agreement between ChatGPT and other raters was 0.27. ChatGPT shows promise in automating the article screening phase of systematic reviews, achieving high sensitivity and workload savings. While not entirely replacing human expertise, it could serve as an efficient first-line screening tool, particularly in reducing the burden on human resources. Further studies are needed to fine-tune its capabilities and validate its utility across different medical subfields.

Sections du résumé

BACKGROUND BACKGROUND
The screening process for systematic reviews and meta-analyses in medical research is a labor-intensive and time-consuming task. While machine learning and deep learning have been applied to facilitate this process, these methods often require training data and user annotation. This study aims to assess the efficacy of ChatGPT, a large language model based on the Generative Pretrained Transformers (GPT) architecture, in automating the screening process for systematic reviews in radiology without the need for training data.
METHODS METHODS
A prospective simulation study was conducted between May 2nd and 24th, 2023, comparing ChatGPT's performance in screening abstracts against that of general physicians (GPs). A total of 1198 abstracts across three subfields of radiology were evaluated. Metrics such as sensitivity, specificity, positive and negative predictive values (PPV and NPV), workload saving, and others were employed. Statistical analyses included the Kappa coefficient for inter-rater agreement, ROC curve plotting, AUC calculation, and bootstrapping for p-values and confidence intervals.
RESULTS RESULTS
ChatGPT completed the screening process within an hour, while GPs took an average of 7-10 days. The AI model achieved a sensitivity of 95% and an NPV of 99%, slightly outperforming the GPs' sensitive consensus (i.e., including records if at least one person includes them). It also exhibited remarkably low false negative counts and high workload savings, ranging from 40 to 83%. However, ChatGPT had lower specificity and PPV compared to human raters. The average Kappa agreement between ChatGPT and other raters was 0.27.
CONCLUSIONS CONCLUSIONS
ChatGPT shows promise in automating the article screening phase of systematic reviews, achieving high sensitivity and workload savings. While not entirely replacing human expertise, it could serve as an efficient first-line screening tool, particularly in reducing the burden on human resources. Further studies are needed to fine-tune its capabilities and validate its utility across different medical subfields.

Identifiants

pubmed: 38539117
doi: 10.1186/s12874-024-02203-8
pii: 10.1186/s12874-024-02203-8
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

78

Informations de copyright

© 2024. The Author(s).

Références

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
doi: 10.1038/nature14539 pubmed: 26017442
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv pre-print server. 2016.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. 2017.
Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018.
Alec R, Karthik N. Improving language understanding by generative pre-training. 2018.
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1).
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language Models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. 2020. p. 1877–901.
Alec R, Jeff W, Rewon C, David L, Dario A, Ilya S. Language models are unsupervised multitask learners. 2019.
Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. 2023.
Alessandro L, Douglas GA, Jennifer Marie T, Cynthia DM, Peter Christian G, John PAI, et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. PLoS Med. 2009;6.
Byron CW, Kevin S, Carla EB, Joseph L, Thomas AT. Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. 2012.
Ana Helena Salles dos R, Ana Luiza Miranda de O, Carolina F, James Z, Paulo F, Janaine Cunha P. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst Rev. 2023;12.
Kevin EKC, Robin LJL, Daniel FG, Leo N. Research Screener: a machine learning tool to semi-automate abstract screening for systematic reviews. Syst Rev. 2021;10.
Amir V, Mana M, Amin N-A, Seyed Hossein Hosseini A, Mehrnush Saghab T, Reyhaneh A, et al. Abstract screening using the automated tool Rayyan: results of effectiveness in three diagnostic test accuracy systematic reviews. BMC Med Res Methodol. 2022;22.
The EndNote Team. EndNote. EndNote X9 ed. Philadelphia, PA: Clarivate; 2013.
McKinney W. Data Structures for Statistical Computing in Python2010. 56–61 p.
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–62.
doi: 10.1038/s41586-020-2649-2 pubmed: 32939066 pmcid: 7759461
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in P ython. J Mach Learn Res. 2011;12:2825–30.
Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9:90–5.
doi: 10.1109/MCSE.2007.55
Waskom M. seaborn: statistical data visualization. J Open Source Softw. 2021;6:3021.
doi: 10.21105/joss.03021
Viera AJ, Garrett JM. Understanding interobserver agreement: the kappa statistic. Fam Med. 2005;37(5):360–3.
pubmed: 15883903
Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74.
doi: 10.1016/j.patrec.2005.10.010
Schisterman EF, Perkins NJ, Liu A, Bondell H. Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples. Epidemiology. 2005;16(1):73–81.
doi: 10.1097/01.ede.0000147512.81966.ba pubmed: 15613948
Jaccard index: Wikipedia; 2023. updated 2023, May 21. Available from: https://en.wikipedia.org/wiki/Jaccard_index .
Efron B, Tibshirani RJ. An Introduction to the Bootstrap. 1st ed. New York: Chapman and Hall/CRC; 1994.
doi: 10.1201/9780429246593
Rayyan—a web and mobile app for systematic reviews. Syst Rev. 2016;5(1):210.
Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA, editors. Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. International Health Informatics Symposium; 2012.
Kahili-Heede MK, Hillgren KJ. Colandr. J Med Library Assoc. 2021;109:523–5.
dos Reis AHS, de Oliveira ALM, Fritsch C, Zouch J, Ferreira P, Polese JC. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst Rev. 2023;12.
Bannach-Brown A, Przybyła P, Thomas J, Rice ASC, Ananiadou S, Liao J, Macleod MR. Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error. Syst Rev. 2019;8(1):23.
doi: 10.1186/s13643-019-0942-7 pubmed: 30646959 pmcid: 6334440
Gates A, Johnson C, Hartling L. Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool. Syst Rev. 2018;7.
Rathbone J, Hoffmann TC, Glasziou PP. Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers. Syst Rev. 2015;4.
Gates A, Guitard S, Pillay J, Elliott SA, Dyson MP, Newton AS, Hartling L. Performance and usability of machine learning for screening in systematic reviews: a comparative evaluation of three tools. Syst Rev. 2019;8(1):278.
doi: 10.1186/s13643-019-1222-2 pubmed: 31727150 pmcid: 6857345
Methley A, Campbell SM, Chew‐Graham CA, McNally R, Cheraghi-Sohi S. PICO, PICOS and SPIDER: a comparison study of specificity and sensitivity in three search tools for qualitative systematic reviews. BMC Health Serv Res. 2014;14.
Booth A. Clear and present questions: formulating questions for evidence based practice. Library Hi Tech. 2006;24(3):355–68.
doi: 10.1108/07378830610692127
Wildridge V, Bell L. How CLIP became ECLIPSE: a mnemonic to assist in searching for health policy/management information. Health Info Libr J. 2002;19(2):113–5.
doi: 10.1046/j.1471-1842.2002.00378.x pubmed: 12389609
Wang B, Deng X, Sun H, editors. Iteratively Prompt Pre-trained Language Models for Chain of Thought2022 December; Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.

Auteurs

Mahbod Issaiy (M)

Advanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical Science, Tehran, Iran.

Hossein Ghanaati (H)

Advanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical Science, Tehran, Iran.

Shahriar Kolahi (S)

Advanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical Science, Tehran, Iran.

Madjid Shakiba (M)

Advanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical Science, Tehran, Iran.

Amir Hossein Jalali (AH)

Advanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical Science, Tehran, Iran.

Diana Zarei (D)

Advanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical Science, Tehran, Iran.

Sina Kazemian (S)

Cardiac Primary Prevention Research Center, Cardiovascular Diseases Research Institute, Tehran University of Medical Sciences, Tehran, Iran.

Mahsa Alborzi Avanaki (MA)

Advanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical Science, Tehran, Iran.

Kavous Firouznia (K)

Advanced Diagnostic and Interventional Radiology Research Center (ADIR), Tehran University of Medical Science, Tehran, Iran. k_firouznia@yahoo.com.

Classifications MeSH