Large language models encode clinical knowledge.


Journal

Nature
ISSN: 1476-4687
Titre abrégé: Nature
Pays: England
ID NLM: 0410462

Informations de publication

Date de publication:
Aug 2023
Historique:
received: 25 01 2023
accepted: 05 06 2023
medline: 4 8 2023
pubmed: 13 7 2023
entrez: 12 7 2023
Statut: ppublish

Résumé

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model

Identifiants

pubmed: 37438534
doi: 10.1038/s41586-023-06291-2
pii: 10.1038/s41586-023-06291-2
pmc: PMC10396962
doi:

Types de publication

Comparative Study Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

172-180

Commentaires et corrections

Type : ErratumIn

Informations de copyright

© 2023. The Author(s).

Références

Chowdhery, A. et al. PaLM: scaling language modeling with pathways. Preprint at https://doi.org/10.48550/arXiv.2204.02311 (2022).
Chung, H. W. et al. Scaling instruction-finetuned language models. Preprint at https://doi.org/10.48550/arXiv.2210.11416 (2022).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
doi: 10.3390/app11146421
Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning 248–260 (Proceedings of Machine Learning Research, 2022).
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. Preprint at https://doi.org/10.48550/arXiv.1909.06146 (2019).
Hendrycks, D. et al. Measuring massive multitask language understanding. Preprint at https://doi.org/10.48550/arXiv.2009.03300 (2020).
Esteva, A. et al. Deep learning-enabled medical computer vision. NPJ Digit. Med. 4, 5 (2021).
doi: 10.1038/s41746-020-00376-2 pubmed: 33420381 pmcid: 7794558
Tomašev, N. et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat. Protoc. 16, 2765–2787 (2021).
doi: 10.1038/s41596-021-00513-5 pubmed: 33953393
Yim, J. et al. Predicting conversion to wet age-related macular degeneration using deep learning. Nat. Med. 26, 892–899 (2020).
doi: 10.1038/s41591-020-0867-7 pubmed: 32424211
Lakkaraju, H., Slack, D., Chen, Y., Tan, C. & Singh, S. Rethinking explainability as a dialogue: a practitioner’s perspective. Preprint at https://doi.org/10.48550/arXiv.2202.01875 (2022).
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://doi.org/10.48550/arXiv.2108.07258 (2021).
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. BLEU: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics 311–318 (Association of Computational Machinery, 2002).
Ben Abacha, A., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. TREC https://trec.nist.gov/pubs/trec26/papers/Overview-QA.pdf?ref=https://githubhelp.com (2017).
Abacha, A. B. et al. in Studies in Health Technology and Informatics (eds Ohno-Machado, L. & Séroussi, B.) 25–29 (IOS Press, 2019).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Wei, J. et al. Chain of thought prompting elicits reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2201.11903 (2022).
Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. Preprint at https://doi.org/10.48550/arXiv.2203.11171 (2022).
Yasunaga, M. et al. Deep bidirectional language-knowledge graph pretraining. Preprint at https://doi.org/10.48550/arXiv.2210.09338 (2022).
Bolton, E. et al. Stanford CRFM introduces PubMedGPT 2.7B. Stanford University https://hai.stanford.edu/news/stanford-crfm-introduces-pubmedgpt-27b (2022).
Taylor, R. et al. Galactica: a large language model for science. Preprint at https://doi.org/10.48550/arXiv.2211.09085 (2022).
Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinformatics 23, bbac49 (2022).
doi: 10.1093/bib/bbac409
Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. Preprint at https://doi.org/10.48550/arXiv.2205.14334 (2022).
Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https://doi.org/10.48550/arXiv.2207.05221 (2022).
Tran, D. et al. Plex: towards reliability using pretrained large model extensions. Preprint at https://doi.org/10.48550/arXiv.2207.07411 (2022).
Feng, S. Y., Khetan, V., Sacaleanu, B., Gershman, A. & Hovy, E. CHARD: clinical health-aware reasoning across dimensions for text generation models. Preprint at https://doi.org/10.48550/arXiv.2210.04191 (2022).
Williams, T., Szekendi, M., Pavkovic, S., Clevenger, W. & Cerese, J. The reliability of ahrq common format harm scales in rating patient safety events. J. Patient Saf. 11, 52–59 (2015).
doi: 10.1097/PTS.0b013e3182948ef9 pubmed: 24080718
Walsh, K. E. et al. Measuring harm in healthcare: optimizing adverse event review. Med. Care 55, 436 (2017).
doi: 10.1097/MLR.0000000000000679 pubmed: 27906769 pmcid: 5352561
Wei, J. et al. Emergent abilities of large language models. Preprint at https://doi.org/10.48550/arXiv.2206.07682 (2022).
Kington, R. S. et al. Identifying credible sources of health information in social media: principles and attributes. NAM Perspectives https://doi.org/10.31478%2F202107a (2021).
Mandavilli, A. Medical journals blind to racism as health crisis, critics say. The New York Times https://www.nytimes.com/2021/06/02/health/jama-racism-bauchner.html (2021).
Shoemaker, S. J., Wolf, M. S. & Brach, C. Development of the patient education materials assessment tool (pemat): a new measure of understandability and actionability for print and audiovisual patient information. Patient Educ. Couns. 96, 395–403 (2014).
doi: 10.1016/j.pec.2014.05.027 pubmed: 24973195 pmcid: 5085258
Boateng, G. O., Neilands, T. B., Frongillo, E. A., Melgar-Quiñonez, H. R. & Young, S. L. Best practices for developing and validating scales for health, social, and behavioral research: a primer. Front. Public Health 6, 149 (2018).
doi: 10.3389/fpubh.2018.00149 pubmed: 29942800 pmcid: 6004510
Hooker, S. Moving beyond “algorithmic bias is a data problem”. Patterns 2, 100241 (2021).
doi: 10.1016/j.patter.2021.100241 pubmed: 33982031 pmcid: 8085589
Chen, I. Y. et al. Ethical machine learning in healthcare. Annu. Rev. Biomed. Data Sci. 4, 123–144 (2021).
doi: 10.1146/annurev-biodatasci-092820-114757 pubmed: 34396058 pmcid: 8362902
Eneanya, N. D. et al. Health inequities and the inappropriate use of race in nephrology. Nat. Rev. Nephrol. 18, 84–94 (2022).
doi: 10.1038/s41581-021-00501-8 pubmed: 34750551
Vyas, L. G., Eisenstein, L. G. & Jones, D. S. Hidden in plain sight-reconsidering the use of race correction in clinical algorithms. N. Engl. J. Med. 383, 874–882 (2020).
doi: 10.1056/NEJMms2004740 pubmed: 32853499
Weidinger, L. et al. Ethical and social risks of harm from language models. Preprint at https://doi.org/10.48550/arXiv.2112.04359 (2021).
Liang, P. et al. Holistic evaluation of language models. Preprint at https://doi.org/10.48550/arXiv.2211.09110 (2022).
Liu, X. et al. The medical algorithmic audit. Lancet Digit. Health 4, e384–e397 (2022).
doi: 10.1016/S2589-7500(22)00003-6 pubmed: 35396183
Raji, I. D. et al. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proc. 2020 Conference on Fairness, Accountability, and Transparency 33–44 (Association for Computing Machinery, 2020).
Rostamzadeh, N. et al. Healthsheet: development of a transparency artifact for health datasets. Preprint at https://doi.org/10.48550/arXiv.2202.13028 (2022).
Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).
doi: 10.1145/3458723
Mitchell, M. et al. Model cards for model reporting. In Proc. conference on Fairness, Accountability, and Transparency 220–229 (Association for Computing Machinery, 2019).
Garg, S. et al. Counterfactual fairness in text classification through robustness. In Proc. 2019 AAAI/ACM Conference on AI, Ethics, and Society 219–226 (Association for Computing Machinery, 2019).
Prabhakaran, V., Hutchinson, B. & Mitchell, M. Perturbation sensitivity analysis to detect unintended model biases. Preprint at https://doi.org/10.48550/arXiv.1910.04210 (2019).
Zhang, H., Lu, A. X., Abdalla, M., McDermott, M. & Ghassemi, M. Hurtful words: quantifying biases in clinical contextual word embeddings. In Proc. ACM Conference on Health, Inference, and Learning 110–120 (Association for Computing Machinery, 2020).
Matheny, M., Israni, S. T., Ahmed, M. & Whicher, D. eds. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril (National Academy of Medicine, 2022).
The White House Office of Science and Technology Policy. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf (The White House, 2022).
Ethics and Governance of Artificial Intelligence for Health. WHO Guidance (World Health Organization, 2021).
Bommasani, R., Liang, P. & Lee, T. Language models are changing AI: the need for holistic evaluation. Stanford University https://crfm.stanford.edu/2022/11/17/helm.html (2022).
Pampari, A., Raghavan, P., Liang, J. & Peng, J. emrQA: a large corpus for question answering on electronic medical records. Preprint at https://doi.org/10.48550/arXiv.1809.00732 (2018).
Tsatsaronis, G. et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16, 138 (2015).
doi: 10.1186/s12859-015-0564-6 pubmed: 25925131 pmcid: 4450488
Morgado, F. F., Meireles, J. F., Neves, C., Amaral, A. & Ferreira, M. E. Scale development: ten main limitations and recommendations to improve future research practices. Psic. Reflex. Crit. 30, 5 (2017).
doi: 10.1186/s41155-017-0059-7
Barham, P. et al. Pathways: asynchronous distributed dataflow for ML. Proc. Mach. Learn. Syst. 4, 430–449 (2022).
Thoppilan, R. et al. Lamda: language models for dialog applications. Preprint at https://doi.org/10.48550/arXiv.2201.08239 (2022).
Du, N. et al. Glam: efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning 5547–5569 (PMLR, 2022).
Srivastava, A. et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Preprint at https://doi.org/10.48550/arXiv.2206.04615 (2022).
Clark, J. H. et al. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Trans. Assoc. Comput. Linguist. 8, 454–470 (2020).
doi: 10.1162/tacl_a_00317
Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. Preprint at https://doi.org/10.48550/arXiv.2104.08691 (2021).
Nye, M. et al. Show your work: scratchpads for intermediate computation with language models. Preprint at https://doi.org/10.48550/arXiv.2112.00114 (2021).
Zhou, D. et al. Least-to-most prompting enables complex reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2205.10625 (2022).
Cobbe, K. et al. Training verifiers to solve math word problems. Preprint at https://doi.org/10.48550/arXiv.2110.14168 (2021).
Lewkowycz, A. et al. Solving quantitative reasoning problems with language models. Preprint at https://doi.org/10.48550/arXiv.2206.14858 (2022).
Ackley, D. H., Hinton, G. E. & Sejnowski, T. J. A learning algorithm for boltzmann machines. Cogn. Sci. 9, 147–169 (1985).
doi: 10.1207/s15516709cog0901_7
Ficler, J. & Goldberg, Y. Controlling linguistic style aspects in neural language generation. Preprint at https://doi.org/10.48550/arXiv.1707.02633 (2017).
Li, X. L. & Liang, P. Prefix-tuning: optimizing continuous prompts for generation. Preprint at https://doi.org/10.48550/arXiv.2101.00190 (2021).
Wei, J. et al. Finetuned language models are zero-shot learners. Preprint at https://doi.org/10.48550/arXiv.2109.01652 (2021).
Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. Preprint at https://doi.org/10.48550/arXiv.2107.13586 (2021).
Liu, X. et al. GPT understands, too. Preprint at https://doi.org/10.48550/arXiv.2103.10385 (2021).
Han, X., Zhao, W., Ding, N., Liu, Z. & Sun, M. PTR: prompt tuning with rules for text classification. AI Open 3, 182–192 (2022).
doi: 10.1016/j.aiopen.2022.11.003
Gu, Y., Han, X., Liu, Z. & Huang, M. PPT: Pre-trained prompt tuning for few-shot learning. Preprint at https://doi.org/10.48550/arXiv.2109.04332 (2021).
Ye, S., Jang, J., Kim, D., Jo, Y. & Seo, M. Retrieval of soft prompt enhances zero-shot task generalization. Preprint at https://doi.org/10.48550/arXiv.2210.03029 (2022).
Hoffmann, J. et al. Training compute-optimal large language models. Preprint at https://doi.org/10.48550/arXiv.2203.15556 (2022).
Scao, T. L. et al. BLOOM: a 176B-parameter open-access multilingual language model. Preprint at https://doi.org/10.48550/arXiv.2211.05100 (2022).
Rae, J. W. et al. Scaling language models: methods, analysis & insights from training Gopher. Preprint at https://doi.org/10.48550/arXiv.2112.11446 (2021).
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1–67 (2020).
Zhang, S. et al. OPT: open pre-trained transformer language models. Preprint at https://doi.org/10.48550/arXiv.2205.01068 (2022).
Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems (Association of Computational Machinery, 2017).
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/10.48550/arXiv.2001.08361 (2020).
Lampinen, A. K. et al. Can language models learn from explanations in context? Preprint at https://doi.org/10.48550/arXiv.2204.02329 (2022).
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. Preprint at https://doi.org/10.48550/arXiv.2205.11916 (2022).
Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. Preprint at https://doi.org/10.48550/arXiv.1705.03551 (2017).
Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. Preprint at https://doi.org/10.48550/arXiv.1903.10676 (2019).
Lewis, P., Ott, M., Du, J. & Stoyanov, V. Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art. In Proc. 3rd Clinical Natural Language Processing Workshop (eds Roberts, K., Bethard, S. & Naumann, T.) 146–157 (Association for Computational Linguistics, 2020).
Shin, H.-C. et al. BioMegatron: larger biomedical domain language model. Preprint at https://doi.org/10.48550/arXiv.2010.06060 (2020).
Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
doi: 10.1093/bioinformatics/btz682 pubmed: 31501885
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 2 (2021).
Papanikolaou, Y. & Pierleoni, A. DARE: data augmented relation extraction with GPT-2. Preprint at https://doi.org/10.48550/arXiv.2004.13845 (2020).
Hong, Z. et al. The diminishing returns of masked language models to science. Preprint at https://doi.org/10.48550/arXiv.2205.11342 (2023).
Korngiebel, D. M. & Mooney, S. D. Considering the possibilities and pitfalls of generative pre-trained transformer 3 (GPT-3) in healthcare delivery. NPJ Digit. Med. 4, 93 (2021).
doi: 10.1038/s41746-021-00464-x pubmed: 34083689 pmcid: 8175735
Sezgin, E., Sirrianni, J. & Linwood, S. L. et al. Operationalizing and implementing pretrained, large artificial intelligence linguistic models in the us health care system: outlook of generative pretrained transformer 3 (GPT-3) as a service model. JMIR Med. Informatics 10, e32875 (2022).
doi: 10.2196/32875
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are zero-shot clinical information extractors. Preprint at https://doi.org/10.48550/arXiv.2205.12689 (2022).
Liévin, V., Hother, C. E. & Winther, O. Can large language models reason about medical questions? Preprint at https://doi.org/10.48550/arXiv.2207.08143 (2022).
Ouyang, L. et al. Training language models to follow instructions with human feedback. Preprint at https://doi.org/10.48550/arXiv.2203.02155 (2022).

Auteurs

Karan Singhal (K)

Google Research, Mountain View, CA, USA. karansinghal@google.com.

Shekoofeh Azizi (S)

Google Research, Mountain View, CA, USA. shekazizi@google.com.

Tao Tu (T)

Google Research, Mountain View, CA, USA.

S Sara Mahdavi (SS)

Google Research, Mountain View, CA, USA.

Jason Wei (J)

Google Research, Mountain View, CA, USA.

Hyung Won Chung (HW)

Google Research, Mountain View, CA, USA.

Nathan Scales (N)

Google Research, Mountain View, CA, USA.

Ajay Tanwani (A)

Google Research, Mountain View, CA, USA.

Heather Cole-Lewis (H)

Google Research, Mountain View, CA, USA.

Stephen Pfohl (S)

Google Research, Mountain View, CA, USA.

Perry Payne (P)

Google Research, Mountain View, CA, USA.

Martin Seneviratne (M)

Google Research, Mountain View, CA, USA.

Paul Gamble (P)

Google Research, Mountain View, CA, USA.

Chris Kelly (C)

Google Research, Mountain View, CA, USA.

Abubakr Babiker (A)

Google Research, Mountain View, CA, USA.

Nathanael Schärli (N)

Google Research, Mountain View, CA, USA.

Aakanksha Chowdhery (A)

Google Research, Mountain View, CA, USA.

Philip Mansfield (P)

Google Research, Mountain View, CA, USA.

Dina Demner-Fushman (D)

National Library of Medicine, Bethesda, MD, USA.

Blaise Agüera Y Arcas (B)

Google Research, Mountain View, CA, USA.

Dale Webster (D)

Google Research, Mountain View, CA, USA.

Greg S Corrado (GS)

Google Research, Mountain View, CA, USA.

Yossi Matias (Y)

Google Research, Mountain View, CA, USA.

Katherine Chou (K)

Google Research, Mountain View, CA, USA.

Juraj Gottweis (J)

Google Research, Mountain View, CA, USA.

Yun Liu (Y)

Google Research, Mountain View, CA, USA.

Alvin Rajkomar (A)

Google Research, Mountain View, CA, USA.

Joelle Barral (J)

Google Research, Mountain View, CA, USA.

Christopher Semturs (C)

Google Research, Mountain View, CA, USA.

Alan Karthikesalingam (A)

Google Research, Mountain View, CA, USA. alankarthi@google.com.

Vivek Natarajan (V)

Google Research, Mountain View, CA, USA. natviv@google.com.

Articles similaires

1.00
Humans Personnel Staffing and Scheduling Nursing Staff, Hospital Male Adult

How Certification Exams Reflect Current Practice.

Tara L Myers, Sean DeGarmo, Marianne Horahan
1.00
Humans Certification Clinical Competence Education, Nursing, Continuing Adult
Humans Medical Futility Turkey Qualitative Research Terminal Care
Humans Meta-Analysis as Topic Sample Size Models, Statistical Computer Simulation

Classifications MeSH