A toolbox for surfacing health equity harms and biases in large language models.
Journal
Nature medicine
ISSN: 1546-170X
Titre abrégé: Nat Med
Pays: United States
ID NLM: 9502015
Informations de publication
Date de publication:
23 Sep 2024
23 Sep 2024
Historique:
received:
26
03
2024
accepted:
20
08
2024
medline:
24
9
2024
pubmed:
24
9
2024
entrez:
23
9
2024
Statut:
aheadofprint
Résumé
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare.
Identifiants
pubmed: 39313595
doi: 10.1038/s41591-024-03258-2
pii: 10.1038/s41591-024-03258-2
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© 2024. The Author(s).
Références
Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 141 (2023).
pubmed: 37816837
pmcid: 10564921
doi: 10.1038/s43856-023-00370-1
Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: the potentials and pitfalls: a narrative review. Ann. Intern. Med. 177, 210–220 (2024).
pubmed: 38285984
doi: 10.7326/M23-2772
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
pubmed: 37438534
pmcid: 10396962
doi: 10.1038/s41586-023-06291-2
Singhal, K. et al. Towards expert-level medical question answering with large language models. Preprint at https://arxiv.org/abs/2305.09617 (2023).
Zakka, C. et al. Almanac — retrieval-augmented language models for clinical medicine. NEJM AI https://doi.org/10.1056/aioa2300068 (2024).
Yang, X. et al. A large language model for electronic health records. NPJ Digit. Med. 5, 194 (2022).
pubmed: 36572766
pmcid: 9792464
doi: 10.1038/s41746-022-00742-2
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. In Proc. of the 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 1998–2022 (ACL, 2022).
Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).
pubmed: 37318797
pmcid: 10273128
doi: 10.1001/jama.2023.8288
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Preprint at https://arxiv.org/abs/2312.00164 (2023).
Moor, M. et al. Med-Flamingo: a multimodal medical few-shot learner. In Proc. of the 3rd Machine Learning for Health Symposium (eds Hegelsmann, S. et al.) 353–367 (PMLR, 2023).
Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, AIoa2300138 (2024).
doi: 10.1056/AIoa2300138
Liu, X. et al. Large language models are few-shot health learners. Preprint at https://arxiv.org/abs/2305.15525 (2023).
Harrer, S. Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine. EBioMedicine 90, 104512 (2023).
Singh, N., Lawrence, K., Richardson, S. & Mann, D. M. Centering health equity in large language model deployment. PLoS Digit. Health 2, e0000367 (2023).
pubmed: 37874780
pmcid: 10597518
doi: 10.1371/journal.pdig.0000367
Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388, 1233–1239 (2023).
pubmed: 36988602
doi: 10.1056/NEJMsr2214184
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? In FAccT '21: Proc. of the 2021 ACM Conference on Fairness, Accountability, and Transparency (eds Elish, M. et al.) 610–623 (ACM, 2021).
Bailey, Z. D. et al. Structural racism and health inequities in the USA: evidence and interventions. Lancet 389, 1453–1463 (2017).
pubmed: 28402827
doi: 10.1016/S0140-6736(17)30569-X
Williams, D. R., Lawrence, J. A., Davis, B. A. & Vu, C. Understanding how discrimination can affect health. Health Serv. Res. 54, 1374–1388 (2019).
pubmed: 31663121
pmcid: 6864381
doi: 10.1111/1475-6773.13222
World Health Organization. A Conceptual Framework for Action on the Social Determinants of Health www.who.int/publications/i/item/9789241500852 (2010).
World Health Organization. Operational Framework for Monitoring Social Determinants of Health Equity www.who.int/publications/i/item/9789240088320 (2024).
Arora, A. et al. The value of standards for health datasets in artificial intelligence-based applications. Nat. Med. 29, 2929–2938 (2023).
pubmed: 37884627
pmcid: 10667100
doi: 10.1038/s41591-023-02608-w
Ferryman, K., Mackintosh, M. & Ghassemi, M. Considering biased data as informative artifacts in AI-assisted health care. N. Engl. J. Med. 389, 833–838 (2023).
pubmed: 37646680
doi: 10.1056/NEJMra2214964
Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. NPJ Digit. Med. 6, 195 (2023).
pubmed: 37864012
pmcid: 10589311
doi: 10.1038/s41746-023-00939-z
Eneanya, N. D. et al. Health inequities and the inappropriate use of race in nephrology. Nat. Rev. Nephrol. 18, 84–94 (2022).
pubmed: 34750551
doi: 10.1038/s41581-021-00501-8
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
pubmed: 31649194
doi: 10.1126/science.aax2342
Passi, S. & Barocas, S. Problem formulation and fairness. In FAT* ’19: Proc. of the Conference on Fairness, Accountability, and Transparency (eds Boyd, D. et al.) 39–48 (ACM, 2019).
Chen, I. Y. et al. Ethical machine learning in healthcare. Annu. Rev. Biomed. Data Sci. 4, 123–144 (2021).
pubmed: 34396058
pmcid: 8362902
doi: 10.1146/annurev-biodatasci-092820-114757
Pfohl, S. R., Foryciarz, A. & Shah, N. H. An empirical characterization of fair machine learning for clinical risk prediction. J. Biomed. Inform. 113, 103621 (2021).
pubmed: 33220494
doi: 10.1016/j.jbi.2020.103621
Zack, T. et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit. Health 6, e12–e22 (2024).
pubmed: 38123252
doi: 10.1016/S2589-7500(23)00225-X
World Health Organization. Health Equity www.who.int/health-topics/health-equity (2021).
Barocas, S., Hardt, M. & Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities (MIT, 2023).
Abràmoff, M. D. et al. Considerations for addressing bias in artificial intelligence for health equity. NPJ Digit. Med. 6, 170 (2023).
pubmed: 37700029
pmcid: 10497548
doi: 10.1038/s41746-023-00913-9
Cary, M. P. et al. Mitigating racial and ethnic bias and advancing health equity in clinical algorithms: a scoping review. Health Aff. 42, 1359–1368 (2023).
doi: 10.1377/hlthaff.2023.00553
Feffer, M., Sinha, A., Lipton, Z. C. & Heidari, H. Red-teaming for generative AI: silver bullet or security theater? Preprint at https://arxiv.org/abs/2401.15897 (2024).
Ganguli, D. et al. Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. Preprint at https://arxiv.org/abs/2209.07858 (2022).
Perez, E. et al. Red teaming language models with language models. In Proc. of the 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 3419–3448 (ACL, 2022).
Liu, X. et al. The medical algorithmic audit. Lancet Digit. Health 4, e384–e397 (2022).
pubmed: 35396183
doi: 10.1016/S2589-7500(22)00003-6
Sperrin, M., Riley, R. D., Collins, G. S. & Martin, G. P. Targeted validation: validating clinical prediction models in their intended population and setting. Diagn. Progn. Res. 6, 24 (2022).
pubmed: 36550534
pmcid: 9773429
doi: 10.1186/s41512-022-00136-8
Raji, I. D. et al. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In FAT* ’20: Proc. of the 2020 Conference on Fairness, Accountability, and Transparency (eds Hildebrandt, M. et al.) 33–44 (ACM, 2020).
Kahng, M. et al. LLM comparator: visual analytics for side-by-side evaluation of large language models. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (eds Mueller, F. et al.) 1–7 (ACM, 2024).
Randolph, J. J. Free-marginal multirater kappa (multirater K
Krippendorff, K. Estimating the reliability, systematic error and random error of interval data. Educ. Psychol. Meas. 30, 61–70 (1970).
doi: 10.1177/001316447003000105
Inker, L. A. et al. New creatinine- and cystatin C-based equations to estimate GFR without race. N. Engl. J. Med. 385, 1737–1749 (2021).
pubmed: 34554658
pmcid: 8822996
doi: 10.1056/NEJMoa2102953
Prabhakaran, V. et al. GRASP: a disagreement analysis framework to assess group associations in perspectives. In Proc. of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Duh, K. et al.) 3473–3492 (ACL, 2024).
Homan, C. M. et al. Intersectionality in AI safety: using multilevel models to understand diverse perceptions of safety in conversational AI. In Proc. of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024 (eds Abercrombie, G. et al.) 131–141 (2024).
Feinstein, A. R. & Cicchetti, D. V. High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 43, 543–549 (1990).
pubmed: 2348207
doi: 10.1016/0895-4356(90)90158-L
Quarfoot, D. & Levine, R. A. How robust are multirater interrater reliability indices to changes in frequency distribution? Am. Stat. 70, 373–384 (2016).
doi: 10.1080/00031305.2016.1141708
Wang, D. et al. All that agrees is not gold: evaluating ground truth labels and dialogue content for safety. Preprint at Google Research https://research.google/pubs/all-that-agrees-is-not-gold-evaluating-ground-truth-labels-and-dialogue-content-for-safety (2023).
Paun, S. et al. Comparing Bayesian models of annotation. Trans. Assoc. Comput. Linguist. 6, 571–585 (2018).
doi: 10.1162/tacl_a_00040
Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R. & Wallach, H. Stereotyping Norwegian salmon: an inventory of pitfalls in fairness benchmark datasets. In Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (eds Zong, C. et al.) 1004–1015 (ACL, 2021).
Johnson, T. P. Handbook of Health Survey Methods (Wiley Online Library, 2015).
Harkness, J. A. et al. In Comparative Survey Methodology Ch. 1, 1–16 (John Wiley & Sons, 2010).
Miceli, M. et al. Documenting data production processes: a participatory approach for data work. In Proc. of the ACM on Human–Computer Interaction (ed Nichols, J.) 510 (ACM, 2022).
Birhane, A. et al. Power to the people? Opportunities and challenges for participatory AI. In EAAMO '22: Proc. of the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (eds Falcettoni, E. et al.) 6 (ACM, 2022).
Asiedu, M. et al. The case for globalizing fairness: a mixed methods study on colonialism, AI, and health in Africa. Preprint at https://arxiv.org/abs/2403.03357 (2024).
Sambasivan, N., Arnesen, E., Hutchinson, B., Doshi, T. & Prabhakaran, V. Re-imagining algorithmic fairness in India and beyond. In FAccT '21: Proc. of the 2021 ACM Conference on Fairness, Accountability, and Transparency (eds Elish, M. et al.) 315–328 (ACM, 2021).
Birhane, A. Algorithmic colonization of Africa. SCRIPTed 17, 389–409 (2020).
doi: 10.2966/scrip.170220.389
Mitchell, M. et al. Model cards for model reporting. In FAT* '19: Proc. of the Conference on Fairness, Accountability, and Transparency (eds Boyd, D. et al.) 220–229 (ACM, 2019).
Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).
doi: 10.1145/3458723
Stiennon, N. et al. Learning to summarize with human feedback. Adv. Neural Inf. Process. Syst. 33, 3008–3021 (2020).
Smith-Loud, J. et al. The Equitable AI Research Roundtable (EARR): towards community-based decision making in responsible AI development. Preprint at https://arxiv.org/abs/2303.08177 (2023).
Shelby, R. et al. Sociotechnical harms of algorithmic systems: scoping a taxonomy for harm reduction. In AIES '23: Proc. of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (eds Ross, F. et al.) 723–741 (ACM, 2023).
Weidinger, L. et al. Sociotechnical safety evaluation of generative AI systems. Preprint at https://arxiv.org/abs/2310.11986 (2023).
Kusner, M. J., Loftus, J., Russell, C. & Silva, R. Counterfactual fairness. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) (Curran Associates, 2017).
Garg, S. et al. Counterfactual fairness in text classification through robustness. In Proc. of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (eds Markham, A. et al.) 219–226 (ACM, 2019).
Prabhakaran, V., Hutchinson, B. & Mitchell, M. Perturbation sensitivity analysis to detect unintended model biases. In Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP–IJCNLP) (eds Inui, K. et al.) 5740–5745 (ACL, 2019).
Qualtrics. 28 Questions to Help Buyers of Online Samples www.iup.edu/arl/files/qualtrics/esomar.pdf (2019).
Ben Abacha, A., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. In Text Retrieval Conference 2017 (2017).
Abacha, A. B. et al. Bridging the gap between consumers’ medication questions and trusted answers. Stud. Health Technol. Inform. 264, 25–29 (2019).
Seabold, S. & Perktold, J. statsmodels: econometric and statistical modeling with Python. In Proc. of the 9th Python in Science Conference (eds van der Walt, S. & Millman, J.) 92–96 (SciPy, 2010).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
pubmed: 32015543
pmcid: 7056644
doi: 10.1038/s41592-019-0686-2
Castro, S. Fast Krippendorff: fast computation of Krippendorff’s alpha agreement measure. GitHub github.com/pln-fing-udelar/fast-krippendorff (2017).
Wong, K., Paritosh, P. & Aroyo, L. Cross-replication reliability — an empirical approach to interpreting inter-rater reliability. In Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (eds Zong, C. et al.) 7053–7065 (ACL, 2021).
Efron, B. Better bootstrap confidence intervals. J. Am. Stat. Assoc. 82, 171–185 (1987).
doi: 10.1080/01621459.1987.10478410
Field, C. A. & Welsh, A. H. Bootstrapping clustered data. J. R. Stat. Soc. B Stat. Methodol. 69, 369–390 (2007).
doi: 10.1111/j.1467-9868.2007.00593.x
Pfohl, S. et al. A toolbox for surfacing health equity harms and biases in large language models. figshare https://doi.org/10.6084/m9.figshare.26133973 (2024).