Large language models encode clinical knowledge.
Journal
Nature
ISSN: 1476-4687
Titre abrégé: Nature
Pays: England
ID NLM: 0410462
Informations de publication
Date de publication:
Aug 2023
Aug 2023
Historique:
received:
25
01
2023
accepted:
05
06
2023
medline:
4
8
2023
pubmed:
13
7
2023
entrez:
12
7
2023
Statut:
ppublish
Résumé
Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model
Identifiants
pubmed: 37438534
doi: 10.1038/s41586-023-06291-2
pii: 10.1038/s41586-023-06291-2
pmc: PMC10396962
doi:
Types de publication
Comparative Study
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
172-180Commentaires et corrections
Type : ErratumIn
Informations de copyright
© 2023. The Author(s).
Références
Chowdhery, A. et al. PaLM: scaling language modeling with pathways. Preprint at https://doi.org/10.48550/arXiv.2204.02311 (2022).
Chung, H. W. et al. Scaling instruction-finetuned language models. Preprint at https://doi.org/10.48550/arXiv.2210.11416 (2022).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
doi: 10.3390/app11146421
Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning 248–260 (Proceedings of Machine Learning Research, 2022).
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. Preprint at https://doi.org/10.48550/arXiv.1909.06146 (2019).
Hendrycks, D. et al. Measuring massive multitask language understanding. Preprint at https://doi.org/10.48550/arXiv.2009.03300 (2020).
Esteva, A. et al. Deep learning-enabled medical computer vision. NPJ Digit. Med. 4, 5 (2021).
doi: 10.1038/s41746-020-00376-2
pubmed: 33420381
pmcid: 7794558
Tomašev, N. et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat. Protoc. 16, 2765–2787 (2021).
doi: 10.1038/s41596-021-00513-5
pubmed: 33953393
Yim, J. et al. Predicting conversion to wet age-related macular degeneration using deep learning. Nat. Med. 26, 892–899 (2020).
doi: 10.1038/s41591-020-0867-7
pubmed: 32424211
Lakkaraju, H., Slack, D., Chen, Y., Tan, C. & Singh, S. Rethinking explainability as a dialogue: a practitioner’s perspective. Preprint at https://doi.org/10.48550/arXiv.2202.01875 (2022).
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://doi.org/10.48550/arXiv.2108.07258 (2021).
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. BLEU: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics 311–318 (Association of Computational Machinery, 2002).
Ben Abacha, A., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. TREC https://trec.nist.gov/pubs/trec26/papers/Overview-QA.pdf?ref=https://githubhelp.com (2017).
Abacha, A. B. et al. in Studies in Health Technology and Informatics (eds Ohno-Machado, L. & Séroussi, B.) 25–29 (IOS Press, 2019).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Wei, J. et al. Chain of thought prompting elicits reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2201.11903 (2022).
Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. Preprint at https://doi.org/10.48550/arXiv.2203.11171 (2022).
Yasunaga, M. et al. Deep bidirectional language-knowledge graph pretraining. Preprint at https://doi.org/10.48550/arXiv.2210.09338 (2022).
Bolton, E. et al. Stanford CRFM introduces PubMedGPT 2.7B. Stanford University https://hai.stanford.edu/news/stanford-crfm-introduces-pubmedgpt-27b (2022).
Taylor, R. et al. Galactica: a large language model for science. Preprint at https://doi.org/10.48550/arXiv.2211.09085 (2022).
Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinformatics 23, bbac49 (2022).
doi: 10.1093/bib/bbac409
Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. Preprint at https://doi.org/10.48550/arXiv.2205.14334 (2022).
Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https://doi.org/10.48550/arXiv.2207.05221 (2022).
Tran, D. et al. Plex: towards reliability using pretrained large model extensions. Preprint at https://doi.org/10.48550/arXiv.2207.07411 (2022).
Feng, S. Y., Khetan, V., Sacaleanu, B., Gershman, A. & Hovy, E. CHARD: clinical health-aware reasoning across dimensions for text generation models. Preprint at https://doi.org/10.48550/arXiv.2210.04191 (2022).
Williams, T., Szekendi, M., Pavkovic, S., Clevenger, W. & Cerese, J. The reliability of ahrq common format harm scales in rating patient safety events. J. Patient Saf. 11, 52–59 (2015).
doi: 10.1097/PTS.0b013e3182948ef9
pubmed: 24080718
Walsh, K. E. et al. Measuring harm in healthcare: optimizing adverse event review. Med. Care 55, 436 (2017).
doi: 10.1097/MLR.0000000000000679
pubmed: 27906769
pmcid: 5352561
Wei, J. et al. Emergent abilities of large language models. Preprint at https://doi.org/10.48550/arXiv.2206.07682 (2022).
Kington, R. S. et al. Identifying credible sources of health information in social media: principles and attributes. NAM Perspectives https://doi.org/10.31478%2F202107a (2021).
Mandavilli, A. Medical journals blind to racism as health crisis, critics say. The New York Times https://www.nytimes.com/2021/06/02/health/jama-racism-bauchner.html (2021).
Shoemaker, S. J., Wolf, M. S. & Brach, C. Development of the patient education materials assessment tool (pemat): a new measure of understandability and actionability for print and audiovisual patient information. Patient Educ. Couns. 96, 395–403 (2014).
doi: 10.1016/j.pec.2014.05.027
pubmed: 24973195
pmcid: 5085258
Boateng, G. O., Neilands, T. B., Frongillo, E. A., Melgar-Quiñonez, H. R. & Young, S. L. Best practices for developing and validating scales for health, social, and behavioral research: a primer. Front. Public Health 6, 149 (2018).
doi: 10.3389/fpubh.2018.00149
pubmed: 29942800
pmcid: 6004510
Hooker, S. Moving beyond “algorithmic bias is a data problem”. Patterns 2, 100241 (2021).
doi: 10.1016/j.patter.2021.100241
pubmed: 33982031
pmcid: 8085589
Chen, I. Y. et al. Ethical machine learning in healthcare. Annu. Rev. Biomed. Data Sci. 4, 123–144 (2021).
doi: 10.1146/annurev-biodatasci-092820-114757
pubmed: 34396058
pmcid: 8362902
Eneanya, N. D. et al. Health inequities and the inappropriate use of race in nephrology. Nat. Rev. Nephrol. 18, 84–94 (2022).
doi: 10.1038/s41581-021-00501-8
pubmed: 34750551
Vyas, L. G., Eisenstein, L. G. & Jones, D. S. Hidden in plain sight-reconsidering the use of race correction in clinical algorithms. N. Engl. J. Med. 383, 874–882 (2020).
doi: 10.1056/NEJMms2004740
pubmed: 32853499
Weidinger, L. et al. Ethical and social risks of harm from language models. Preprint at https://doi.org/10.48550/arXiv.2112.04359 (2021).
Liang, P. et al. Holistic evaluation of language models. Preprint at https://doi.org/10.48550/arXiv.2211.09110 (2022).
Liu, X. et al. The medical algorithmic audit. Lancet Digit. Health 4, e384–e397 (2022).
doi: 10.1016/S2589-7500(22)00003-6
pubmed: 35396183
Raji, I. D. et al. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proc. 2020 Conference on Fairness, Accountability, and Transparency 33–44 (Association for Computing Machinery, 2020).
Rostamzadeh, N. et al. Healthsheet: development of a transparency artifact for health datasets. Preprint at https://doi.org/10.48550/arXiv.2202.13028 (2022).
Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).
doi: 10.1145/3458723
Mitchell, M. et al. Model cards for model reporting. In Proc. conference on Fairness, Accountability, and Transparency 220–229 (Association for Computing Machinery, 2019).
Garg, S. et al. Counterfactual fairness in text classification through robustness. In Proc. 2019 AAAI/ACM Conference on AI, Ethics, and Society 219–226 (Association for Computing Machinery, 2019).
Prabhakaran, V., Hutchinson, B. & Mitchell, M. Perturbation sensitivity analysis to detect unintended model biases. Preprint at https://doi.org/10.48550/arXiv.1910.04210 (2019).
Zhang, H., Lu, A. X., Abdalla, M., McDermott, M. & Ghassemi, M. Hurtful words: quantifying biases in clinical contextual word embeddings. In Proc. ACM Conference on Health, Inference, and Learning 110–120 (Association for Computing Machinery, 2020).
Matheny, M., Israni, S. T., Ahmed, M. & Whicher, D. eds. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril (National Academy of Medicine, 2022).
The White House Office of Science and Technology Policy. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf (The White House, 2022).
Ethics and Governance of Artificial Intelligence for Health. WHO Guidance (World Health Organization, 2021).
Bommasani, R., Liang, P. & Lee, T. Language models are changing AI: the need for holistic evaluation. Stanford University https://crfm.stanford.edu/2022/11/17/helm.html (2022).
Pampari, A., Raghavan, P., Liang, J. & Peng, J. emrQA: a large corpus for question answering on electronic medical records. Preprint at https://doi.org/10.48550/arXiv.1809.00732 (2018).
Tsatsaronis, G. et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16, 138 (2015).
doi: 10.1186/s12859-015-0564-6
pubmed: 25925131
pmcid: 4450488
Morgado, F. F., Meireles, J. F., Neves, C., Amaral, A. & Ferreira, M. E. Scale development: ten main limitations and recommendations to improve future research practices. Psic. Reflex. Crit. 30, 5 (2017).
doi: 10.1186/s41155-017-0059-7
Barham, P. et al. Pathways: asynchronous distributed dataflow for ML. Proc. Mach. Learn. Syst. 4, 430–449 (2022).
Thoppilan, R. et al. Lamda: language models for dialog applications. Preprint at https://doi.org/10.48550/arXiv.2201.08239 (2022).
Du, N. et al. Glam: efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning 5547–5569 (PMLR, 2022).
Srivastava, A. et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Preprint at https://doi.org/10.48550/arXiv.2206.04615 (2022).
Clark, J. H. et al. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Trans. Assoc. Comput. Linguist. 8, 454–470 (2020).
doi: 10.1162/tacl_a_00317
Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. Preprint at https://doi.org/10.48550/arXiv.2104.08691 (2021).
Nye, M. et al. Show your work: scratchpads for intermediate computation with language models. Preprint at https://doi.org/10.48550/arXiv.2112.00114 (2021).
Zhou, D. et al. Least-to-most prompting enables complex reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2205.10625 (2022).
Cobbe, K. et al. Training verifiers to solve math word problems. Preprint at https://doi.org/10.48550/arXiv.2110.14168 (2021).
Lewkowycz, A. et al. Solving quantitative reasoning problems with language models. Preprint at https://doi.org/10.48550/arXiv.2206.14858 (2022).
Ackley, D. H., Hinton, G. E. & Sejnowski, T. J. A learning algorithm for boltzmann machines. Cogn. Sci. 9, 147–169 (1985).
doi: 10.1207/s15516709cog0901_7
Ficler, J. & Goldberg, Y. Controlling linguistic style aspects in neural language generation. Preprint at https://doi.org/10.48550/arXiv.1707.02633 (2017).
Li, X. L. & Liang, P. Prefix-tuning: optimizing continuous prompts for generation. Preprint at https://doi.org/10.48550/arXiv.2101.00190 (2021).
Wei, J. et al. Finetuned language models are zero-shot learners. Preprint at https://doi.org/10.48550/arXiv.2109.01652 (2021).
Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. Preprint at https://doi.org/10.48550/arXiv.2107.13586 (2021).
Liu, X. et al. GPT understands, too. Preprint at https://doi.org/10.48550/arXiv.2103.10385 (2021).
Han, X., Zhao, W., Ding, N., Liu, Z. & Sun, M. PTR: prompt tuning with rules for text classification. AI Open 3, 182–192 (2022).
doi: 10.1016/j.aiopen.2022.11.003
Gu, Y., Han, X., Liu, Z. & Huang, M. PPT: Pre-trained prompt tuning for few-shot learning. Preprint at https://doi.org/10.48550/arXiv.2109.04332 (2021).
Ye, S., Jang, J., Kim, D., Jo, Y. & Seo, M. Retrieval of soft prompt enhances zero-shot task generalization. Preprint at https://doi.org/10.48550/arXiv.2210.03029 (2022).
Hoffmann, J. et al. Training compute-optimal large language models. Preprint at https://doi.org/10.48550/arXiv.2203.15556 (2022).
Scao, T. L. et al. BLOOM: a 176B-parameter open-access multilingual language model. Preprint at https://doi.org/10.48550/arXiv.2211.05100 (2022).
Rae, J. W. et al. Scaling language models: methods, analysis & insights from training Gopher. Preprint at https://doi.org/10.48550/arXiv.2112.11446 (2021).
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1–67 (2020).
Zhang, S. et al. OPT: open pre-trained transformer language models. Preprint at https://doi.org/10.48550/arXiv.2205.01068 (2022).
Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems (Association of Computational Machinery, 2017).
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/10.48550/arXiv.2001.08361 (2020).
Lampinen, A. K. et al. Can language models learn from explanations in context? Preprint at https://doi.org/10.48550/arXiv.2204.02329 (2022).
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. Preprint at https://doi.org/10.48550/arXiv.2205.11916 (2022).
Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. Preprint at https://doi.org/10.48550/arXiv.1705.03551 (2017).
Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. Preprint at https://doi.org/10.48550/arXiv.1903.10676 (2019).
Lewis, P., Ott, M., Du, J. & Stoyanov, V. Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art. In Proc. 3rd Clinical Natural Language Processing Workshop (eds Roberts, K., Bethard, S. & Naumann, T.) 146–157 (Association for Computational Linguistics, 2020).
Shin, H.-C. et al. BioMegatron: larger biomedical domain language model. Preprint at https://doi.org/10.48550/arXiv.2010.06060 (2020).
Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
doi: 10.1093/bioinformatics/btz682
pubmed: 31501885
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 2 (2021).
Papanikolaou, Y. & Pierleoni, A. DARE: data augmented relation extraction with GPT-2. Preprint at https://doi.org/10.48550/arXiv.2004.13845 (2020).
Hong, Z. et al. The diminishing returns of masked language models to science. Preprint at https://doi.org/10.48550/arXiv.2205.11342 (2023).
Korngiebel, D. M. & Mooney, S. D. Considering the possibilities and pitfalls of generative pre-trained transformer 3 (GPT-3) in healthcare delivery. NPJ Digit. Med. 4, 93 (2021).
doi: 10.1038/s41746-021-00464-x
pubmed: 34083689
pmcid: 8175735
Sezgin, E., Sirrianni, J. & Linwood, S. L. et al. Operationalizing and implementing pretrained, large artificial intelligence linguistic models in the us health care system: outlook of generative pretrained transformer 3 (GPT-3) as a service model. JMIR Med. Informatics 10, e32875 (2022).
doi: 10.2196/32875
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are zero-shot clinical information extractors. Preprint at https://doi.org/10.48550/arXiv.2205.12689 (2022).
Liévin, V., Hother, C. E. & Winther, O. Can large language models reason about medical questions? Preprint at https://doi.org/10.48550/arXiv.2207.08143 (2022).
Ouyang, L. et al. Training language models to follow instructions with human feedback. Preprint at https://doi.org/10.48550/arXiv.2203.02155 (2022).