Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations.

Bioinformatics Genetics Machine Learning Phenotype Prediction Precision Medicine

Journal

bioRxiv : the preprint server for biology
Titre abrégé: bioRxiv
Pays: United States
ID NLM: 101680187

Informations de publication

Date de publication:
17 Oct 2023
Historique:
pubmed: 31 10 2023
medline: 31 10 2023
entrez: 31 10 2023
Statut: epublish

Résumé

Precision medicine models often perform better for populations of European ancestry due to the over-representation of this group in the genomic datasets and large-scale biobanks from which the models are constructed. As a result, prediction models may misrepresent or provide less accurate treatment recommendations for underrepresented populations, contributing to health disparities. This study introduces an adaptable machine learning toolkit that integrates multiple existing methodologies and novel techniques to enhance the prediction accuracy for underrepresented populations in genomic datasets. By leveraging machine learning techniques, including gradient boosting and automated methods, coupled with novel population-conditional re-sampling techniques, our method significantly improves the phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse populations. We evaluate our approach using the UK Biobank, which is composed primarily of British individuals with European ancestry, and a minority representation of groups with Asian and African ancestry. Performance metrics demonstrate substantial improvements in phenotype prediction for underrepresented groups, achieving prediction accuracy comparable to that of the majority group. This approach represents a significant step towards improving prediction accuracy amidst current dataset diversity challenges. By integrating a tailored pipeline, our approach fosters more equitable validity and utility of statistical genetics methods, paving the way for more inclusive models and outcomes.

Identifiants

pubmed: 37904983
doi: 10.1101/2023.10.12.561949
pmc: PMC10614800
pii:
doi:

Types de publication

Preprint

Langues

eng

Références

Nature. 2011 Jul 13;475(7355):163-5
pubmed: 21753830
Nat Commun. 2022 Apr 1;13(1):1728
pubmed: 35365602
Front Genet. 2019 May 31;10:459
pubmed: 31214240
Mach Learn. 2020;109(2):251-277
pubmed: 32174648
Am J Hum Genet. 2017 Apr 6;100(4):635-649
pubmed: 28366442
Nat Genet. 2019 Apr;51(4):584-591
pubmed: 30926966
Bioinformatics. 2019 Jul 15;35(14):2495-2497
pubmed: 30520965
Nature. 2019 Jun;570(7762):514-518
pubmed: 31217584
Nature. 2016 Oct 12;538(7624):161-164
pubmed: 27734877
PLoS Genet. 2020 Oct 23;16(10):e1009141
pubmed: 33095761
Nat Rev Genet. 2014 Jan;15(1):22-33
pubmed: 24296533
Commun Med (Lond). 2022 Sep 1;2:111
pubmed: 36059892
Bioinformatics. 2022 Sep 16;38(Suppl_2):ii27-ii33
pubmed: 36124792
Nature. 2021 Aug;596(7873):583-589
pubmed: 34265844
Front Big Data. 2021 Jul 01;4:688969
pubmed: 34278297
Trends Genet. 2009 Nov;25(11):489-94
pubmed: 19836853
PLoS Comput Biol. 2022 Aug 25;18(8):e1010301
pubmed: 36007005
Genet Res (Camb). 2013 Dec;95(6):157-64
pubmed: 24629460
Heredity (Edinb). 2018 Jun;120(6):500-514
pubmed: 29426878
Nat Genet. 2022 Apr;54(4):450-458
pubmed: 35393596
PLoS Genet. 2022 Mar 24;18(3):e1010105
pubmed: 35324888
G3 (Bethesda). 2020 Jan 7;10(1):109-115
pubmed: 31649046
Nat Comput Sci. 2023 Jul;3(7):621-629
pubmed: 37600116
Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul;2022:3558-3562
pubmed: 36085664
Int J Epidemiol. 2021 Jul 9;50(3):717-718e
pubmed: 34143882
Nucleic Acids Res. 2017 Jan 4;45(D1):D896-D901
pubmed: 27899670
PLoS Med. 2015 Mar 31;12(3):e1001779
pubmed: 25826379
Commun Med (Lond). 2021 Aug 23;1:25
pubmed: 34522916
Nat Commun. 2019 Jul 25;10(1):3328
pubmed: 31346163
Genome Res. 2009 Sep;19(9):1655-64
pubmed: 19648217
Genome Res. 2007 Oct;17(10):1520-8
pubmed: 17785532
Springerplus. 2016 Jul 15;5(1):1080
pubmed: 27462528
PLoS One. 2022 Aug 31;17(8):e0273293
pubmed: 36044406

Auteurs

David Bonet (D)

Stanford University, Stanford, CA, US.
Universitat Politècnica de Catalunya, Barcelona, Spain.

May Levin (M)

Stanford University, Stanford, CA, US.

Daniel Mas Montserrat (DM)

Stanford University, Stanford, CA, US.

Alexander G Ioannidis (AG)

Stanford University, Stanford, CA, US.
University of California Santa Cruz, Santa Cruz, CA, US.

Classifications MeSH