An Approach to Identifying and Quantifying Bias in Biomedical Data.


Journal

Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
ISSN: 2335-6936
Titre abrégé: Pac Symp Biocomput
Pays: United States
ID NLM: 9711271

Informations de publication

Date de publication:
2023
Historique:
entrez: 21 12 2022
pubmed: 22 12 2022
medline: 23 12 2022
Statut: ppublish

Résumé

Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically representative unlabeled data must be developed. To mitigate the adverse effects of unrepresentative data, we consider a binary semi-supervised setting and focus on identifying whether the labeled data is biased and to what extent. We assume that the class-conditional distributions were generated by a family of component distributions represented at different proportions in labeled and unlabeled data. We also assume that the training data can be transformed to and subsequently modeled by a nested mixture of multivariate Gaussian distributions. We then develop a multi-sample expectation-maximization algorithm that learns all individual and shared parameters of the model from the combined data. Using these parameters, we develop a statistical test for the presence of the general form of bias in labeled data and estimate the level of this bias by computing the distance between corresponding class-conditional distributions in labeled and unlabeled data. We first study the new methods on synthetic data to understand their behavior and then apply them to real-world biomedical data to provide evidence that the bias estimation procedure is both possible and effective.

Identifiants

pubmed: 36540987
pii: 9789811270611_0029
pmc: PMC9782737
mid: NIHMS1852996

Types de publication

Journal Article Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

311-322

Subventions

Organisme : NICHD NIH HHS
ID : R01 HD101246
Pays : United States
Organisme : NHGRI NIH HHS
ID : U01 HG012022
Pays : United States

Références

Proc Natl Acad Sci U S A. 2020 Feb 25;117(8):3920-3929
pubmed: 32054788
PLoS Biol. 2018 Sep 18;16(9):e2006643
pubmed: 30226837
Nat Med. 2022 Jan;28(1):31-38
pubmed: 35058619
Nucleic Acids Res. 2016 Jan 4;44(D1):D862-8
pubmed: 26582918
Clin Transl Sci. 2021 Jan;14(1):86-93
pubmed: 32961010
Neural Comput. 2002 Jan;14(1):21-41
pubmed: 11747533
Bioinformatics. 2015 Apr 15;31(8):1204-10
pubmed: 25504647
Phys Rev E Stat Nonlin Soft Matter Phys. 2001 Dec;64(6 Pt 1):061907
pubmed: 11736210
Radiology. 1982 Apr;143(1):29-36
pubmed: 7063747
Entropy (Basel). 2020 Feb 13;22(2):
pubmed: 33285988
IEEE Trans Neural Netw Learn Syst. 2021 Sep;32(9):3930-3941
pubmed: 32845846
Nature. 2020 May;581(7809):434-443
pubmed: 32461654
IEEE Trans Biomed Eng. 2010 Apr;57(4):884-93
pubmed: 19932995

Auteurs

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH