Large language models encode clinical knowledge.

Benchmarking Bias Clinical Competence Comprehension Computer Simulation Datasets as Topic Knowledge Licensure Medicine / methods Natural Language Processing Patient Safety Physicians

Journal

Nature

ISSN: 1476-4687

Titre abrégé: Nature

Pays: England

ID NLM: 0410462

Informations de publication

Date de publication:
Aug 2023

Historique:

received: 25 01 2023

accepted: 05 06 2023

medline: 4 8 2023

pubmed: 13 7 2023

entrez: 12 7 2023

Statut: ppublish

Résumé

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model

Identifiants

DOI: 10.1038/s41586-023-06291-2 PMID: 37438534 PMC: PMC10396962

pubmed: 37438534

doi: 10.1038/s41586-023-06291-2

pii: 10.1038/s41586-023-06291-2

pmc: PMC10396962

doi:

Types de publication

Comparative Study Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

172-180

Commentaires et corrections

Type : ErratumIn

Informations de copyright

Références

Chowdhery, A. et al. PaLM: scaling language modeling with pathways. Preprint at https://doi.org/10.48550/arXiv.2204.02311 (2022).

Chung, H. W. et al. Scaling instruction-finetuned language models. Preprint at https://doi.org/10.48550/arXiv.2210.11416 (2022).

Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).

doi: 10.3390/app11146421

Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning 248–260 (Proceedings of Machine Learning Research, 2022).

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. Preprint at https://doi.org/10.48550/arXiv.1909.06146 (2019).

Hendrycks, D. et al. Measuring massive multitask language understanding. Preprint at https://doi.org/10.48550/arXiv.2009.03300 (2020).

Esteva, A. et al. Deep learning-enabled medical computer vision. NPJ Digit. Med. 4, 5 (2021).

doi: 10.1038/s41746-020-00376-2 pubmed: 33420381 pmcid: 7794558

Tomašev, N. et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat. Protoc. 16, 2765–2787 (2021).

doi: 10.1038/s41596-021-00513-5 pubmed: 33953393

Yim, J. et al. Predicting conversion to wet age-related macular degeneration using deep learning. Nat. Med. 26, 892–899 (2020).

doi: 10.1038/s41591-020-0867-7 pubmed: 32424211

Lakkaraju, H., Slack, D., Chen, Y., Tan, C. & Singh, S. Rethinking explainability as a dialogue: a practitioner’s perspective. Preprint at https://doi.org/10.48550/arXiv.2202.01875 (2022).

Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://doi.org/10.48550/arXiv.2108.07258 (2021).

Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. BLEU: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics 311–318 (Association of Computational Machinery, 2002).

Ben Abacha, A., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. TREC https://trec.nist.gov/pubs/trec26/papers/Overview-QA.pdf?ref=https://githubhelp.com (2017).

Abacha, A. B. et al. in Studies in Health Technology and Informatics (eds Ohno-Machado, L. & Séroussi, B.) 25–29 (IOS Press, 2019).

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

Wei, J. et al. Chain of thought prompting elicits reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2201.11903 (2022).

Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. Preprint at https://doi.org/10.48550/arXiv.2203.11171 (2022).

Yasunaga, M. et al. Deep bidirectional language-knowledge graph pretraining. Preprint at https://doi.org/10.48550/arXiv.2210.09338 (2022).

Bolton, E. et al. Stanford CRFM introduces PubMedGPT 2.7B. Stanford University https://hai.stanford.edu/news/stanford-crfm-introduces-pubmedgpt-27b (2022).

Taylor, R. et al. Galactica: a large language model for science. Preprint at https://doi.org/10.48550/arXiv.2211.09085 (2022).

Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinformatics 23, bbac49 (2022).

doi: 10.1093/bib/bbac409

Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. Preprint at https://doi.org/10.48550/arXiv.2205.14334 (2022).

Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https://doi.org/10.48550/arXiv.2207.05221 (2022).

Tran, D. et al. Plex: towards reliability using pretrained large model extensions. Preprint at https://doi.org/10.48550/arXiv.2207.07411 (2022).

Feng, S. Y., Khetan, V., Sacaleanu, B., Gershman, A. & Hovy, E. CHARD: clinical health-aware reasoning across dimensions for text generation models. Preprint at https://doi.org/10.48550/arXiv.2210.04191 (2022).

Williams, T., Szekendi, M., Pavkovic, S., Clevenger, W. & Cerese, J. The reliability of ahrq common format harm scales in rating patient safety events. J. Patient Saf. 11, 52–59 (2015).

doi: 10.1097/PTS.0b013e3182948ef9 pubmed: 24080718

Walsh, K. E. et al. Measuring harm in healthcare: optimizing adverse event review. Med. Care 55, 436 (2017).

doi: 10.1097/MLR.0000000000000679 pubmed: 27906769 pmcid: 5352561

Wei, J. et al. Emergent abilities of large language models. Preprint at https://doi.org/10.48550/arXiv.2206.07682 (2022).

Kington, R. S. et al. Identifying credible sources of health information in social media: principles and attributes. NAM Perspectives https://doi.org/10.31478%2F202107a (2021).

Mandavilli, A. Medical journals blind to racism as health crisis, critics say. The New York Times https://www.nytimes.com/2021/06/02/health/jama-racism-bauchner.html (2021).

Shoemaker, S. J., Wolf, M. S. & Brach, C. Development of the patient education materials assessment tool (pemat): a new measure of understandability and actionability for print and audiovisual patient information. Patient Educ. Couns. 96, 395–403 (2014).

doi: 10.1016/j.pec.2014.05.027 pubmed: 24973195 pmcid: 5085258

Boateng, G. O., Neilands, T. B., Frongillo, E. A., Melgar-Quiñonez, H. R. & Young, S. L. Best practices for developing and validating scales for health, social, and behavioral research: a primer. Front. Public Health 6, 149 (2018).

doi: 10.3389/fpubh.2018.00149 pubmed: 29942800 pmcid: 6004510

Hooker, S. Moving beyond “algorithmic bias is a data problem”. Patterns 2, 100241 (2021).

doi: 10.1016/j.patter.2021.100241 pubmed: 33982031 pmcid: 8085589

Chen, I. Y. et al. Ethical machine learning in healthcare. Annu. Rev. Biomed. Data Sci. 4, 123–144 (2021).

doi: 10.1146/annurev-biodatasci-092820-114757 pubmed: 34396058 pmcid: 8362902

Eneanya, N. D. et al. Health inequities and the inappropriate use of race in nephrology. Nat. Rev. Nephrol. 18, 84–94 (2022).

doi: 10.1038/s41581-021-00501-8 pubmed: 34750551

Vyas, L. G., Eisenstein, L. G. & Jones, D. S. Hidden in plain sight-reconsidering the use of race correction in clinical algorithms. N. Engl. J. Med. 383, 874–882 (2020).

doi: 10.1056/NEJMms2004740 pubmed: 32853499

Weidinger, L. et al. Ethical and social risks of harm from language models. Preprint at https://doi.org/10.48550/arXiv.2112.04359 (2021).

Liang, P. et al. Holistic evaluation of language models. Preprint at https://doi.org/10.48550/arXiv.2211.09110 (2022).

Liu, X. et al. The medical algorithmic audit. Lancet Digit. Health 4, e384–e397 (2022).

doi: 10.1016/S2589-7500(22)00003-6 pubmed: 35396183

Raji, I. D. et al. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proc. 2020 Conference on Fairness, Accountability, and Transparency 33–44 (Association for Computing Machinery, 2020).

Rostamzadeh, N. et al. Healthsheet: development of a transparency artifact for health datasets. Preprint at https://doi.org/10.48550/arXiv.2202.13028 (2022).

Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).

doi: 10.1145/3458723

Mitchell, M. et al. Model cards for model reporting. In Proc. conference on Fairness, Accountability, and Transparency 220–229 (Association for Computing Machinery, 2019).

Garg, S. et al. Counterfactual fairness in text classification through robustness. In Proc. 2019 AAAI/ACM Conference on AI, Ethics, and Society 219–226 (Association for Computing Machinery, 2019).

Prabhakaran, V., Hutchinson, B. & Mitchell, M. Perturbation sensitivity analysis to detect unintended model biases. Preprint at https://doi.org/10.48550/arXiv.1910.04210 (2019).

Zhang, H., Lu, A. X., Abdalla, M., McDermott, M. & Ghassemi, M. Hurtful words: quantifying biases in clinical contextual word embeddings. In Proc. ACM Conference on Health, Inference, and Learning 110–120 (Association for Computing Machinery, 2020).

Matheny, M., Israni, S. T., Ahmed, M. & Whicher, D. eds. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril (National Academy of Medicine, 2022).

The White House Office of Science and Technology Policy. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf (The White House, 2022).

Ethics and Governance of Artificial Intelligence for Health. WHO Guidance (World Health Organization, 2021).

Bommasani, R., Liang, P. & Lee, T. Language models are changing AI: the need for holistic evaluation. Stanford University https://crfm.stanford.edu/2022/11/17/helm.html (2022).

Pampari, A., Raghavan, P., Liang, J. & Peng, J. emrQA: a large corpus for question answering on electronic medical records. Preprint at https://doi.org/10.48550/arXiv.1809.00732 (2018).

Tsatsaronis, G. et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16, 138 (2015).

doi: 10.1186/s12859-015-0564-6 pubmed: 25925131 pmcid: 4450488

Morgado, F. F., Meireles, J. F., Neves, C., Amaral, A. & Ferreira, M. E. Scale development: ten main limitations and recommendations to improve future research practices. Psic. Reflex. Crit. 30, 5 (2017).

doi: 10.1186/s41155-017-0059-7

Barham, P. et al. Pathways: asynchronous distributed dataflow for ML. Proc. Mach. Learn. Syst. 4, 430–449 (2022).

Thoppilan, R. et al. Lamda: language models for dialog applications. Preprint at https://doi.org/10.48550/arXiv.2201.08239 (2022).

Du, N. et al. Glam: efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning 5547–5569 (PMLR, 2022).

Srivastava, A. et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Preprint at https://doi.org/10.48550/arXiv.2206.04615 (2022).

Clark, J. H. et al. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Trans. Assoc. Comput. Linguist. 8, 454–470 (2020).

doi: 10.1162/tacl_a_00317

Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. Preprint at https://doi.org/10.48550/arXiv.2104.08691 (2021).

Nye, M. et al. Show your work: scratchpads for intermediate computation with language models. Preprint at https://doi.org/10.48550/arXiv.2112.00114 (2021).

Zhou, D. et al. Least-to-most prompting enables complex reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2205.10625 (2022).

Cobbe, K. et al. Training verifiers to solve math word problems. Preprint at https://doi.org/10.48550/arXiv.2110.14168 (2021).

Lewkowycz, A. et al. Solving quantitative reasoning problems with language models. Preprint at https://doi.org/10.48550/arXiv.2206.14858 (2022).

Ackley, D. H., Hinton, G. E. & Sejnowski, T. J. A learning algorithm for boltzmann machines. Cogn. Sci. 9, 147–169 (1985).

doi: 10.1207/s15516709cog0901_7

Ficler, J. & Goldberg, Y. Controlling linguistic style aspects in neural language generation. Preprint at https://doi.org/10.48550/arXiv.1707.02633 (2017).

Li, X. L. & Liang, P. Prefix-tuning: optimizing continuous prompts for generation. Preprint at https://doi.org/10.48550/arXiv.2101.00190 (2021).

Wei, J. et al. Finetuned language models are zero-shot learners. Preprint at https://doi.org/10.48550/arXiv.2109.01652 (2021).

Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. Preprint at https://doi.org/10.48550/arXiv.2107.13586 (2021).

Liu, X. et al. GPT understands, too. Preprint at https://doi.org/10.48550/arXiv.2103.10385 (2021).

Han, X., Zhao, W., Ding, N., Liu, Z. & Sun, M. PTR: prompt tuning with rules for text classification. AI Open 3, 182–192 (2022).

doi: 10.1016/j.aiopen.2022.11.003

Gu, Y., Han, X., Liu, Z. & Huang, M. PPT: Pre-trained prompt tuning for few-shot learning. Preprint at https://doi.org/10.48550/arXiv.2109.04332 (2021).

Ye, S., Jang, J., Kim, D., Jo, Y. & Seo, M. Retrieval of soft prompt enhances zero-shot task generalization. Preprint at https://doi.org/10.48550/arXiv.2210.03029 (2022).

Hoffmann, J. et al. Training compute-optimal large language models. Preprint at https://doi.org/10.48550/arXiv.2203.15556 (2022).

Scao, T. L. et al. BLOOM: a 176B-parameter open-access multilingual language model. Preprint at https://doi.org/10.48550/arXiv.2211.05100 (2022).

Rae, J. W. et al. Scaling language models: methods, analysis & insights from training Gopher. Preprint at https://doi.org/10.48550/arXiv.2112.11446 (2021).

Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1–67 (2020).

Zhang, S. et al. OPT: open pre-trained transformer language models. Preprint at https://doi.org/10.48550/arXiv.2205.01068 (2022).

Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems (Association of Computational Machinery, 2017).

Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/10.48550/arXiv.2001.08361 (2020).

Lampinen, A. K. et al. Can language models learn from explanations in context? Preprint at https://doi.org/10.48550/arXiv.2204.02329 (2022).

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. Preprint at https://doi.org/10.48550/arXiv.2205.11916 (2022).

Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. Preprint at https://doi.org/10.48550/arXiv.1705.03551 (2017).

Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. Preprint at https://doi.org/10.48550/arXiv.1903.10676 (2019).

Lewis, P., Ott, M., Du, J. & Stoyanov, V. Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art. In Proc. 3rd Clinical Natural Language Processing Workshop (eds Roberts, K., Bethard, S. & Naumann, T.) 146–157 (Association for Computational Linguistics, 2020).

Shin, H.-C. et al. BioMegatron: larger biomedical domain language model. Preprint at https://doi.org/10.48550/arXiv.2010.06060 (2020).

Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).

doi: 10.1093/bioinformatics/btz682 pubmed: 31501885

Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 2 (2021).

Papanikolaou, Y. & Pierleoni, A. DARE: data augmented relation extraction with GPT-2. Preprint at https://doi.org/10.48550/arXiv.2004.13845 (2020).

Hong, Z. et al. The diminishing returns of masked language models to science. Preprint at https://doi.org/10.48550/arXiv.2205.11342 (2023).

Korngiebel, D. M. & Mooney, S. D. Considering the possibilities and pitfalls of generative pre-trained transformer 3 (GPT-3) in healthcare delivery. NPJ Digit. Med. 4, 93 (2021).

doi: 10.1038/s41746-021-00464-x pubmed: 34083689 pmcid: 8175735

Sezgin, E., Sirrianni, J. & Linwood, S. L. et al. Operationalizing and implementing pretrained, large artificial intelligence linguistic models in the us health care system: outlook of generative pretrained transformer 3 (GPT-3) as a service model. JMIR Med. Informatics 10, e32875 (2022).

doi: 10.2196/32875

Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are zero-shot clinical information extractors. Preprint at https://doi.org/10.48550/arXiv.2205.12689 (2022).

Liévin, V., Hother, C. E. & Winther, O. Can large language models reason about medical questions? Preprint at https://doi.org/10.48550/arXiv.2207.08143 (2022).

Ouyang, L. et al. Training language models to follow instructions with human feedback. Preprint at https://doi.org/10.48550/arXiv.2203.02155 (2022).