A toolbox for surfacing health equity harms and biases in large language models.


Journal

Nature medicine
ISSN: 1546-170X
Titre abrégé: Nat Med
Pays: United States
ID NLM: 9502015

Informations de publication

Date de publication:
23 Sep 2024
Historique:
received: 26 03 2024
accepted: 20 08 2024
medline: 24 9 2024
pubmed: 24 9 2024
entrez: 23 9 2024
Statut: aheadofprint

Résumé

Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare.

Identifiants

pubmed: 39313595
doi: 10.1038/s41591-024-03258-2
pii: 10.1038/s41591-024-03258-2
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© 2024. The Author(s).

Références

Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 141 (2023).
pubmed: 37816837 pmcid: 10564921 doi: 10.1038/s43856-023-00370-1
Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: the potentials and pitfalls: a narrative review. Ann. Intern. Med. 177, 210–220 (2024).
pubmed: 38285984 doi: 10.7326/M23-2772
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
pubmed: 37438534 pmcid: 10396962 doi: 10.1038/s41586-023-06291-2
Singhal, K. et al. Towards expert-level medical question answering with large language models. Preprint at https://arxiv.org/abs/2305.09617 (2023).
Zakka, C. et al. Almanac — retrieval-augmented language models for clinical medicine. NEJM AI https://doi.org/10.1056/aioa2300068 (2024).
Yang, X. et al. A large language model for electronic health records. NPJ Digit. Med. 5, 194 (2022).
pubmed: 36572766 pmcid: 9792464 doi: 10.1038/s41746-022-00742-2
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. In Proc. of the 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 1998–2022 (ACL, 2022).
Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).
pubmed: 37318797 pmcid: 10273128 doi: 10.1001/jama.2023.8288
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Preprint at https://arxiv.org/abs/2312.00164 (2023).
Moor, M. et al. Med-Flamingo: a multimodal medical few-shot learner. In Proc. of the 3rd Machine Learning for Health Symposium (eds Hegelsmann, S. et al.) 353–367 (PMLR, 2023).
Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, AIoa2300138 (2024).
doi: 10.1056/AIoa2300138
Liu, X. et al. Large language models are few-shot health learners. Preprint at https://arxiv.org/abs/2305.15525 (2023).
Harrer, S. Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine. EBioMedicine 90, 104512 (2023).
Singh, N., Lawrence, K., Richardson, S. & Mann, D. M. Centering health equity in large language model deployment. PLoS Digit. Health 2, e0000367 (2023).
pubmed: 37874780 pmcid: 10597518 doi: 10.1371/journal.pdig.0000367
Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388, 1233–1239 (2023).
pubmed: 36988602 doi: 10.1056/NEJMsr2214184
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? In FAccT '21: Proc. of the 2021 ACM Conference on Fairness, Accountability, and Transparency (eds Elish, M. et al.) 610–623 (ACM, 2021).
Bailey, Z. D. et al. Structural racism and health inequities in the USA: evidence and interventions. Lancet 389, 1453–1463 (2017).
pubmed: 28402827 doi: 10.1016/S0140-6736(17)30569-X
Williams, D. R., Lawrence, J. A., Davis, B. A. & Vu, C. Understanding how discrimination can affect health. Health Serv. Res. 54, 1374–1388 (2019).
pubmed: 31663121 pmcid: 6864381 doi: 10.1111/1475-6773.13222
World Health Organization. A Conceptual Framework for Action on the Social Determinants of Health www.who.int/publications/i/item/9789241500852 (2010).
World Health Organization. Operational Framework for Monitoring Social Determinants of Health Equity www.who.int/publications/i/item/9789240088320 (2024).
Arora, A. et al. The value of standards for health datasets in artificial intelligence-based applications. Nat. Med. 29, 2929–2938 (2023).
pubmed: 37884627 pmcid: 10667100 doi: 10.1038/s41591-023-02608-w
Ferryman, K., Mackintosh, M. & Ghassemi, M. Considering biased data as informative artifacts in AI-assisted health care. N. Engl. J. Med. 389, 833–838 (2023).
pubmed: 37646680 doi: 10.1056/NEJMra2214964
Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. NPJ Digit. Med. 6, 195 (2023).
pubmed: 37864012 pmcid: 10589311 doi: 10.1038/s41746-023-00939-z
Eneanya, N. D. et al. Health inequities and the inappropriate use of race in nephrology. Nat. Rev. Nephrol. 18, 84–94 (2022).
pubmed: 34750551 doi: 10.1038/s41581-021-00501-8
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
pubmed: 31649194 doi: 10.1126/science.aax2342
Passi, S. & Barocas, S. Problem formulation and fairness. In FAT* ’19: Proc. of the Conference on Fairness, Accountability, and Transparency (eds Boyd, D. et al.) 39–48 (ACM, 2019).
Chen, I. Y. et al. Ethical machine learning in healthcare. Annu. Rev. Biomed. Data Sci. 4, 123–144 (2021).
pubmed: 34396058 pmcid: 8362902 doi: 10.1146/annurev-biodatasci-092820-114757
Pfohl, S. R., Foryciarz, A. & Shah, N. H. An empirical characterization of fair machine learning for clinical risk prediction. J. Biomed. Inform. 113, 103621 (2021).
pubmed: 33220494 doi: 10.1016/j.jbi.2020.103621
Zack, T. et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit. Health 6, e12–e22 (2024).
pubmed: 38123252 doi: 10.1016/S2589-7500(23)00225-X
World Health Organization. Health Equity www.who.int/health-topics/health-equity (2021).
Barocas, S., Hardt, M. & Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities (MIT, 2023).
Abràmoff, M. D. et al. Considerations for addressing bias in artificial intelligence for health equity. NPJ Digit. Med. 6, 170 (2023).
pubmed: 37700029 pmcid: 10497548 doi: 10.1038/s41746-023-00913-9
Cary, M. P. et al. Mitigating racial and ethnic bias and advancing health equity in clinical algorithms: a scoping review. Health Aff. 42, 1359–1368 (2023).
doi: 10.1377/hlthaff.2023.00553
Feffer, M., Sinha, A., Lipton, Z. C. & Heidari, H. Red-teaming for generative AI: silver bullet or security theater? Preprint at https://arxiv.org/abs/2401.15897 (2024).
Ganguli, D. et al. Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. Preprint at https://arxiv.org/abs/2209.07858 (2022).
Perez, E. et al. Red teaming language models with language models. In Proc. of the 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 3419–3448 (ACL, 2022).
Liu, X. et al. The medical algorithmic audit. Lancet Digit. Health 4, e384–e397 (2022).
pubmed: 35396183 doi: 10.1016/S2589-7500(22)00003-6
Sperrin, M., Riley, R. D., Collins, G. S. & Martin, G. P. Targeted validation: validating clinical prediction models in their intended population and setting. Diagn. Progn. Res. 6, 24 (2022).
pubmed: 36550534 pmcid: 9773429 doi: 10.1186/s41512-022-00136-8
Raji, I. D. et al. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In FAT* ’20: Proc. of the 2020 Conference on Fairness, Accountability, and Transparency (eds Hildebrandt, M. et al.) 33–44 (ACM, 2020).
Kahng, M. et al. LLM comparator: visual analytics for side-by-side evaluation of large language models. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (eds Mueller, F. et al.) 1–7 (ACM, 2024).
Randolph, J. J. Free-marginal multirater kappa (multirater K
Krippendorff, K. Estimating the reliability, systematic error and random error of interval data. Educ. Psychol. Meas. 30, 61–70 (1970).
doi: 10.1177/001316447003000105
Inker, L. A. et al. New creatinine- and cystatin C-based equations to estimate GFR without race. N. Engl. J. Med. 385, 1737–1749 (2021).
pubmed: 34554658 pmcid: 8822996 doi: 10.1056/NEJMoa2102953
Prabhakaran, V. et al. GRASP: a disagreement analysis framework to assess group associations in perspectives. In Proc. of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Duh, K. et al.) 3473–3492 (ACL, 2024).
Homan, C. M. et al. Intersectionality in AI safety: using multilevel models to understand diverse perceptions of safety in conversational AI. In Proc. of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024 (eds Abercrombie, G. et al.) 131–141 (2024).
Feinstein, A. R. & Cicchetti, D. V. High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 43, 543–549 (1990).
pubmed: 2348207 doi: 10.1016/0895-4356(90)90158-L
Quarfoot, D. & Levine, R. A. How robust are multirater interrater reliability indices to changes in frequency distribution? Am. Stat. 70, 373–384 (2016).
doi: 10.1080/00031305.2016.1141708
Wang, D. et al. All that agrees is not gold: evaluating ground truth labels and dialogue content for safety. Preprint at Google Research https://research.google/pubs/all-that-agrees-is-not-gold-evaluating-ground-truth-labels-and-dialogue-content-for-safety (2023).
Paun, S. et al. Comparing Bayesian models of annotation. Trans. Assoc. Comput. Linguist. 6, 571–585 (2018).
doi: 10.1162/tacl_a_00040
Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R. & Wallach, H. Stereotyping Norwegian salmon: an inventory of pitfalls in fairness benchmark datasets. In Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (eds Zong, C. et al.) 1004–1015 (ACL, 2021).
Johnson, T. P. Handbook of Health Survey Methods (Wiley Online Library, 2015).
Harkness, J. A. et al. In Comparative Survey Methodology Ch. 1, 1–16 (John Wiley & Sons, 2010).
Miceli, M. et al. Documenting data production processes: a participatory approach for data work. In Proc. of the ACM on Human–Computer Interaction (ed Nichols, J.) 510 (ACM, 2022).
Birhane, A. et al. Power to the people? Opportunities and challenges for participatory AI. In EAAMO '22: Proc. of the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (eds Falcettoni, E. et al.) 6 (ACM, 2022).
Asiedu, M. et al. The case for globalizing fairness: a mixed methods study on colonialism, AI, and health in Africa. Preprint at https://arxiv.org/abs/2403.03357 (2024).
Sambasivan, N., Arnesen, E., Hutchinson, B., Doshi, T. & Prabhakaran, V. Re-imagining algorithmic fairness in India and beyond. In FAccT '21: Proc. of the 2021 ACM Conference on Fairness, Accountability, and Transparency (eds Elish, M. et al.) 315–328 (ACM, 2021).
Birhane, A. Algorithmic colonization of Africa. SCRIPTed 17, 389–409 (2020).
doi: 10.2966/scrip.170220.389
Mitchell, M. et al. Model cards for model reporting. In FAT* '19: Proc. of the Conference on Fairness, Accountability, and Transparency (eds Boyd, D. et al.) 220–229 (ACM, 2019).
Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).
doi: 10.1145/3458723
Stiennon, N. et al. Learning to summarize with human feedback. Adv. Neural Inf. Process. Syst. 33, 3008–3021 (2020).
Smith-Loud, J. et al. The Equitable AI Research Roundtable (EARR): towards community-based decision making in responsible AI development. Preprint at https://arxiv.org/abs/2303.08177 (2023).
Shelby, R. et al. Sociotechnical harms of algorithmic systems: scoping a taxonomy for harm reduction. In AIES '23: Proc. of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (eds Ross, F. et al.) 723–741 (ACM, 2023).
Weidinger, L. et al. Sociotechnical safety evaluation of generative AI systems. Preprint at https://arxiv.org/abs/2310.11986 (2023).
Kusner, M. J., Loftus, J., Russell, C. & Silva, R. Counterfactual fairness. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) (Curran Associates, 2017).
Garg, S. et al. Counterfactual fairness in text classification through robustness. In Proc. of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (eds Markham, A. et al.) 219–226 (ACM, 2019).
Prabhakaran, V., Hutchinson, B. & Mitchell, M. Perturbation sensitivity analysis to detect unintended model biases. In Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP–IJCNLP) (eds Inui, K. et al.) 5740–5745 (ACL, 2019).
Qualtrics. 28 Questions to Help Buyers of Online Samples www.iup.edu/arl/files/qualtrics/esomar.pdf (2019).
Ben Abacha, A., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. In Text Retrieval Conference 2017 (2017).
Abacha, A. B. et al. Bridging the gap between consumers’ medication questions and trusted answers. Stud. Health Technol. Inform. 264, 25–29 (2019).
Seabold, S. & Perktold, J. statsmodels: econometric and statistical modeling with Python. In Proc. of the 9th Python in Science Conference (eds van der Walt, S. & Millman, J.) 92–96 (SciPy, 2010).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
pubmed: 32015543 pmcid: 7056644 doi: 10.1038/s41592-019-0686-2
Castro, S. Fast Krippendorff: fast computation of Krippendorff’s alpha agreement measure. GitHub github.com/pln-fing-udelar/fast-krippendorff (2017).
Wong, K., Paritosh, P. & Aroyo, L. Cross-replication reliability — an empirical approach to interpreting inter-rater reliability. In Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (eds Zong, C. et al.) 7053–7065 (ACL, 2021).
Efron, B. Better bootstrap confidence intervals. J. Am. Stat. Assoc. 82, 171–185 (1987).
doi: 10.1080/01621459.1987.10478410
Field, C. A. & Welsh, A. H. Bootstrapping clustered data. J. R. Stat. Soc. B Stat. Methodol. 69, 369–390 (2007).
doi: 10.1111/j.1467-9868.2007.00593.x
Pfohl, S. et al. A toolbox for surfacing health equity harms and biases in large language models. figshare https://doi.org/10.6084/m9.figshare.26133973 (2024).

Auteurs

Stephen R Pfohl (SR)

Google Research, Mountain View, CA, USA. spfohl@google.com.

Heather Cole-Lewis (H)

Google Research, Mountain View, CA, USA. hcolelewis@google.com.

Rory Sayres (R)

Google Research, Mountain View, CA, USA.

Darlene Neal (D)

Google Research, Mountain View, CA, USA.

Mercy Asiedu (M)

Google Research, Mountain View, CA, USA.

Awa Dieng (A)

Google DeepMind, Mountain View, CA, USA.

Nenad Tomasev (N)

Google DeepMind, Mountain View, CA, USA.

Qazi Mamunur Rashid (QM)

Google Research, Mountain View, CA, USA.

Shekoofeh Azizi (S)

Google DeepMind, Mountain View, CA, USA.

Negar Rostamzadeh (N)

Google Research, Mountain View, CA, USA.

Liam G McCoy (LG)

University of Alberta, Edmonton, Alberta, Canada.

Leo Anthony Celi (LA)

Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA.
Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.

Yun Liu (Y)

Google Research, Mountain View, CA, USA.

Mike Schaekermann (M)

Google Research, Mountain View, CA, USA.

Alanna Walton (A)

Google DeepMind, Mountain View, CA, USA.

Alicia Parrish (A)

Google DeepMind, Mountain View, CA, USA.

Chirag Nagpal (C)

Google Research, Mountain View, CA, USA.

Preeti Singh (P)

Google Research, Mountain View, CA, USA.

Akeiylah Dewitt (A)

Google Research, Mountain View, CA, USA.

Philip Mansfield (P)

Google DeepMind, Mountain View, CA, USA.

Sushant Prakash (S)

Google Research, Mountain View, CA, USA.

Katherine Heller (K)

Google Research, Mountain View, CA, USA.

Alan Karthikesalingam (A)

Google Research, Mountain View, CA, USA.

Christopher Semturs (C)

Google Research, Mountain View, CA, USA.

Joelle Barral (J)

Google DeepMind, Mountain View, CA, USA.

Greg Corrado (G)

Google Research, Mountain View, CA, USA.

Yossi Matias (Y)

Google Research, Mountain View, CA, USA.

Jamila Smith-Loud (J)

Google Research, Mountain View, CA, USA.

Ivor Horn (I)

Google Research, Mountain View, CA, USA.

Karan Singhal (K)

Google Research, Mountain View, CA, USA.

Classifications MeSH