Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses.


Journal

Annals of internal medicine
ISSN: 1539-3704
Titre abrégé: Ann Intern Med
Pays: United States
ID NLM: 0372351

Informations de publication

Date de publication:
21 May 2024
Historique:
medline: 20 5 2024
pubmed: 20 5 2024
entrez: 20 5 2024
Statut: aheadofprint

Résumé

Systematic reviews are performed manually despite the exponential growth of scientific literature. To investigate the sensitivity and specificity of GPT-3.5 Turbo, from OpenAI, as a single reviewer, for title and abstract screening in systematic reviews. Diagnostic test accuracy study. Unannotated bibliographic databases from 5 systematic reviews representing 22 665 citations. None. A generic prompt framework to instruct GPT to perform title and abstract screening was designed. The output of the model was compared with decisions from authors under 2 rules. The first rule balanced sensitivity and specificity, for example, to act as a second reviewer. The second rule optimized sensitivity, for example, to reduce the number of citations to be manually screened. Under the balanced rule, sensitivities ranged from 81.1% to 96.5% and specificities ranged from 25.8% to 80.4%. Across all reviews, GPT identified 7 of 708 citations (1%) missed by humans that should have been included after full-text screening at the cost of 10 279 of 22 665 false-positive recommendations (45.3%) that would require reconciliation during the screening process. Under the sensitive rule, sensitivities ranged from 94.6% to 99.8% and specificities ranged from 2.2% to 46.6%. Limiting manual screening to citations not ruled out by GPT could reduce the number of citations to screen from 127 of 6334 (2%) to 1851 of 4077 (45.4%), at the cost of missing from 0 to 1 of 26 citations (3.8%) at the full-text level. Time needed to fine-tune prompt. Retrospective nature of the study, convenient sample of 5 systematic reviews, and GPT performance sensitive to prompt development and time. The GPT-3.5 Turbo model may be used as a second reviewer for title and abstract screening, at the cost of additional work to reconcile added false positives. It also showed potential to reduce the number of citations before screening by humans, at the cost of missing some citations at the full-text level. None.

Sections du résumé

BACKGROUND UNASSIGNED
Systematic reviews are performed manually despite the exponential growth of scientific literature.
OBJECTIVE UNASSIGNED
To investigate the sensitivity and specificity of GPT-3.5 Turbo, from OpenAI, as a single reviewer, for title and abstract screening in systematic reviews.
DESIGN UNASSIGNED
Diagnostic test accuracy study.
SETTING UNASSIGNED
Unannotated bibliographic databases from 5 systematic reviews representing 22 665 citations.
PARTICIPANTS UNASSIGNED
None.
MEASUREMENTS UNASSIGNED
A generic prompt framework to instruct GPT to perform title and abstract screening was designed. The output of the model was compared with decisions from authors under 2 rules. The first rule balanced sensitivity and specificity, for example, to act as a second reviewer. The second rule optimized sensitivity, for example, to reduce the number of citations to be manually screened.
RESULTS UNASSIGNED
Under the balanced rule, sensitivities ranged from 81.1% to 96.5% and specificities ranged from 25.8% to 80.4%. Across all reviews, GPT identified 7 of 708 citations (1%) missed by humans that should have been included after full-text screening at the cost of 10 279 of 22 665 false-positive recommendations (45.3%) that would require reconciliation during the screening process. Under the sensitive rule, sensitivities ranged from 94.6% to 99.8% and specificities ranged from 2.2% to 46.6%. Limiting manual screening to citations not ruled out by GPT could reduce the number of citations to screen from 127 of 6334 (2%) to 1851 of 4077 (45.4%), at the cost of missing from 0 to 1 of 26 citations (3.8%) at the full-text level.
LIMITATIONS UNASSIGNED
Time needed to fine-tune prompt. Retrospective nature of the study, convenient sample of 5 systematic reviews, and GPT performance sensitive to prompt development and time.
CONCLUSION UNASSIGNED
The GPT-3.5 Turbo model may be used as a second reviewer for title and abstract screening, at the cost of additional work to reconcile added false positives. It also showed potential to reduce the number of citations before screening by humans, at the cost of missing some citations at the full-text level.
PRIMARY FUNDING SOURCE UNASSIGNED
None.

Identifiants

pubmed: 38768452
doi: 10.7326/M23-3389
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Auteurs

Viet-Thi Tran (VT)

Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris; and Centre d'Epidémiologie Clinique, Hôpital Hôtel-Dieu, AP-HP, Paris, France (V.-T.T.).

Gerald Gartlehner (G)

Department for Evidence-based Medicine and Evaluation, University for Continuing Education Krems, Krems, Austria; and Center for Public Health Methods, RTI International, Research Triangle Park, North Carolina (G.G.).

Sally Yaacoub (S)

Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris, France (S.Y., F.A.).

Isabelle Boutron (I)

Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris, France; and Centre d'Epidémiologie Clinique, Hôpital Hôtel-Dieu, AP-HP, Paris, France (I.B.).

Lukas Schwingshackl (L)

Institute for Evidence in Medicine, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany (L.S., J.S., J.M.).

Julia Stadelmaier (J)

Institute for Evidence in Medicine, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany (L.S., J.S., J.M.).

Isolde Sommer (I)

Department for Evidence-based Medicine and Evaluation, University for Continuing Education Krems, Krems, Austria (I.S.).

Farzaneh Alebouyeh (F)

Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris, France (S.Y., F.A.).

Sivem Afach (S)

Epidemiology in Dermatology and Evaluation of Therapeutics (EpiDermE)-EA 7379, University Paris Est Créteil Val de Marne, Créteil, France (S.A.).

Joerg Meerpohl (J)

Institute for Evidence in Medicine, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany (L.S., J.S., J.M.).

Philippe Ravaud (P)

Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris, France; Centre d'Epidémiologie Clinique, Hôpital Hôtel-Dieu, AP-HP, Paris, France; and Department of Epidemiology, Columbia University Mailman School of Public Health, New York, New York (P.R.).

Classifications MeSH