Prioritization of disease genes from GWAS using ensemble-based positive-unlabeled learning.


Journal

European journal of human genetics : EJHG
ISSN: 1476-5438
Titre abrégé: Eur J Hum Genet
Pays: England
ID NLM: 9302235

Informations de publication

Date de publication:
10 2021
Historique:
received: 19 08 2020
accepted: 21 06 2021
revised: 23 05 2021
pubmed: 20 7 2021
medline: 24 3 2022
entrez: 19 7 2021
Statut: ppublish

Résumé

A primary challenge in understanding disease biology from genome-wide association studies (GWAS) arises from the inability to directly implicate causal genes from association data. Integration of multiple-omics data sources potentially provides important functional links between associated variants and candidate genes. Machine-learning is well-positioned to take advantage of a variety of such data and provide a solution for the prioritization of disease genes. Yet, classical positive-negative classifiers impose strong limitations on the gene prioritization procedure, such as a lack of reliable non-causal genes for training. Here, we developed a novel gene prioritization tool-Gene Prioritizer (GPrior). It is an ensemble of five positive-unlabeled bagging classifiers (Logistic Regression, Support Vector Machine, Random Forest, Decision Tree, Adaptive Boosting), that treats all genes of unknown relevance as an unlabeled set. GPrior selects an optimal composition of algorithms to tune the model for each specific phenotype. Altogether, GPrior fills an important niche of methods for GWAS data post-processing, significantly improving the ability to pinpoint disease genes compared to existing solutions.

Identifiants

pubmed: 34276057
doi: 10.1038/s41431-021-00930-w
pii: 10.1038/s41431-021-00930-w
pmc: PMC8484264
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

1527-1535

Informations de copyright

© 2021. The Author(s), under exclusive licence to European Society of Human Genetics.

Références

Ding K, Kullo IJ. Methods for the selection of tagging SNPs: A comparison of tagging efficiency and performance. Eur J Hum Genet. 2007;15:228–36.
doi: 10.1038/sj.ejhg.5201755
Foulkes AS. Applied statistical genetics with R. New York: Springer New York; 2009. https://doi.org/10.1007/978-0-387-89554-3 .
Spain SL, Barrett JC. Strategies for fine-mapping complex traits. Hum Mol Genet. 2015;24:R111–R119.
doi: 10.1093/hmg/ddv260
Stephens M, Balding DJ. Bayesian statistical methods for genetic association studies. Nat Rev Genet. 2009;10:681–90.
doi: 10.1038/nrg2615
Benner C, Spencer CCA, Havulinna AS, Salomaa V, Ripatti S, Pirinen M. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics. 2016;32:1493–501.
doi: 10.1093/bioinformatics/btw018
Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, Eskin E, Price AL, et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 2014;10:1004722.
doi: 10.1371/journal.pgen.1004722
Pickrell JK. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am J Hum Genet. 2014;94:559–73.
doi: 10.1016/j.ajhg.2014.03.004
Wang G, Sarkar A, Carbonetto P, Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Ser B. 2020;82:1273–300.
doi: 10.1111/rssb.12388
Rossin EJ, Lage K, Raychaudhuri S, Xavier RJ, Tatar D, Benita Y, et al. Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology. PLoS Genet. 2011;7:e1001273.
doi: 10.1371/journal.pgen.1001273
Peat G, Jones W, Nuhn M, Marugán JC, Newell W, Dunham I, et al. The open targets post-GWAS analysis pipeline. Bioinformatics. 2020;36:2936–7.
doi: 10.1093/bioinformatics/btaa020
Erratum: Genetic effects on gene expression across human tissues (Nature (2017) 550 (204-13). Nature. 2018;553:530.
Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507:455–61.
doi: 10.1038/nature12787
Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 2012;22:1790–7.
doi: 10.1101/gr.137323.112
Bromberg Y. Chapter 15: disease gene prioritization. PLoS Comput Biol. 2013;9:e1002902.
doi: 10.1371/journal.pcbi.1002902
Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS. Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinform. 2005; 6. https://doi.org/10.1186/1471-2105-6-55 .
Xu J, Li Y. Discovering disease-genes by topological features in human protein–protein interaction network. Bioinformatics. 2006;22:2800–5.
doi: 10.1093/bioinformatics/btl467
Smalter A, Seak FL, Chen XW Human disease-gene classification with integrative sequence-based and topological features of protein-protein interaction networks. In: Proceedings of the 2007 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2007. 2007, pp. 209–14.
Isakov O, Dotan I, Ben-Shachar S. Machine learning–based gene prioritization identifies novel candidate risk genes for inflammatory bowel disease. Inflamm Bowel Dis. 2017;23:1516–23.
doi: 10.1097/MIB.0000000000001222
Denis F PAC learning from positive statistical queries*. In: Lecture notes in computer science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer Verlag; 1998. pp. 112–26.
Sriphaew K, Takamura H, Okumura M. Cool blog classification from positive and unlabeled examples. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Berlin, Heidelberg: Springer; 2009. pp. 62–73.
Bekker J, Davis J. Learning from positive and unlabeled data: a survey. Mach Learn. 2020;109:719–60.
doi: 10.1007/s10994-020-05877-5
Mordelet F, Vert JP. A bagging SVM to learn from positive and unlabeled examples. Pattern Recognit Lett. 2014;37:201–9.
doi: 10.1016/j.patrec.2013.06.010
Yang P, Li X, Chua H-N, Kwoh C-K, Ng S-K. Ensemble positive unlabeled learning for disease gene identification. PLoS One. 2014;9:e97079.
doi: 10.1371/journal.pone.0097079
Scott C, Blanchard G. Novelty detection: unlabeled data definitely help. In: van Dyk D, Welling M (eds). Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics. PMLR: Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 2009, pp. 464–71.
Chen J, Bardes EE, Aronow BJ, Jegga AG. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 2009; 37. https://doi.org/10.1093/nar/gkp427 .
de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLOS Comput Biol. 2015;11:e1004219.
doi: 10.1371/journal.pcbi.1004219
Lehne B, Lewis CM, Schlitt T. From SNPs to genes: disease association at the gene level. PLoS One. 2011;6:e20133.
doi: 10.1371/journal.pone.0020133
Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, et al. Population genomics of human gene expression. Nat Genet. 2007;39:1217–24.
doi: 10.1038/ng2142
Ala U, Piro RM, Grassi E, Damasco C, Silengo L, Oti M, et al. Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Comput Biol. 2008;4:e1000043.
doi: 10.1371/journal.pcbi.1000043
Fine RS, Pers TH, Amariuta T, Raychaudhuri S, Hirschhorn JN. Benchmarker: an unbiased, association-data-driven strategy to evaluate gene prioritization algorithms. Am J Hum Genet. 2019;104:1025–39.
doi: 10.1016/j.ajhg.2019.03.027
Lee WS, Liu B. Learning with positive and unlabeled examples using weighted logistic regression. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML). 2003. p. 2003.
Claesen M, De Smet F, Suykens JAK, De Moor B. A robust ensemble approach to learn from positive and unlabeled data using SVM base models. Neurocomputing. 2015;160:73–84.
doi: 10.1016/j.neucom.2014.10.081
Boyle EA, Li YI, Pritchard JK. An expanded view of complex traits: from polygenic to omnigenic. Cell. 2017;169:1177–86.
doi: 10.1016/j.cell.2017.05.038
Huang H, Fang M, Jostins L, Umićević Mirkov M, Boucher G, Anderson CA, et al. Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature. 2017;547:173–8.
doi: 10.1038/nature22969
Graham DB, Xavier RJ. Pathway paradigms revealed from the genetics of inflammatory bowel disease. Nature 2020;578:527–39.
doi: 10.1038/s41586-020-2025-2
Rivas MA, Beaudoin M, Gardet A, Stevens C, Sharma Y, Zhang CK, et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat Genet. 2011;43:1066–73.
doi: 10.1038/ng.952
Momozawa Y, Dmitrieva J, Théâtre E, Deffontaine V, Rahmouni S, Charloteaux B, et al. IBD risk loci are enriched in multigenic regulatory modules encompassing putative causative genes. Nat Commun. 2018;9:1–18.
doi: 10.1038/s41467-018-04365-8
Liu JZ, Van Sommeren S, Huang H, Ng SC, Alberts R, Takahashi A, et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat Genet. 2015;47:979–86.
doi: 10.1038/ng.3359
Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat Genet. 2018;50:1112–21.
doi: 10.1038/s41588-018-0147-3
Kaplanis J, Samocha KE, Wiel L, Zhang Z, Arvai KJ, Eberhardt RY, et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature. 2020;586:757–62.
doi: 10.1038/s41586-020-2832-5
Van Der Harst P, Verweij N. Identification of 64 novel genetic loci provides an expanded view on the genetic architecture of coronary artery disease. Circ Res. 2018;122:433–43.
doi: 10.1161/CIRCRESAHA.117.312086
Khera AV, Kathiresan S. Genetics of coronary artery disease: Discovery, biology and clinical translation. Nat Rev Genet 2017;18:331–44.
doi: 10.1038/nrg.2016.160
Pardiñas AF, Holmans P, Pocklington AJ, Escott-Price V, Ripke S, Carrera N, et al. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat Genet. 2018;50:381–9.
doi: 10.1038/s41588-018-0059-2
Singh T, Poterba T, Curtis D, Akil H, Eissa M Al, Barchas JD et al. Exome sequencing identifies rare coding variants in 10 genes which confer substantial risk for schizophrenia. medRxiv. 2020; 2020.09.18.20192815.
Tang J, Chen X, Xu X, Wu R, Zhao J, Hu Z, et al. Significant linkage and association between a functional (GT)n polymorphism in promoter of the N-methyl-d-aspartate receptor subunit gene (GRIN2A) and schizophrenia. Neurosci Lett. 2006;409:80–2.
doi: 10.1016/j.neulet.2006.09.022
Koide T, Banno M, Aleksic B, Yamashita S, Kikuchi T, Kohmura K, et al. Correction: Common Variants in MAGI2 Gene Are Associated with Increased Risk for Cognitive Impairment in Schizophrenic Patients. PLoS One. 2012; 7. https://doi.org/10.1371/annotation/47ca9c23-9fdd-47f6-9d36-db0a31769f22 .
Pinacho R, Saia G, Meana JJ, Gill G, Ramos B. Transcription factor SP4 phosphorylation is altered in the postmortem cerebellum of bipolar disorder and schizophrenia subjects. Eur Neuropsychopharmacol. 2015;25:1650–60.
doi: 10.1016/j.euroneuro.2015.05.006
Ripke S, Walters JTR, O’Donovan MC. Mapping genomic loci prioritises genes and implicates synaptic biology in schizophrenia. medRxiv. 2020; 2020.09.12.20192922.

Auteurs

Nikita Kolosov (N)

ITMO University, St. Petersburg, Russia.
Almazov National Medical Research Center, St. Petersburg, Russia.
Broad Institute, Cambridge, MA, USA.

Mark J Daly (MJ)

Broad Institute, Cambridge, MA, USA. mjdaly@atgu.mgh.harvard.edu.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA. mjdaly@atgu.mgh.harvard.edu.
Institute for Molecular Medicine Finland (FIMM), Helsinki, Finland. mjdaly@atgu.mgh.harvard.edu.

Mykyta Artomov (M)

ITMO University, St. Petersburg, Russia. artomov@broadinstitute.org.
Almazov National Medical Research Center, St. Petersburg, Russia. artomov@broadinstitute.org.
Broad Institute, Cambridge, MA, USA. artomov@broadinstitute.org.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA. artomov@broadinstitute.org.
Institute for Molecular Medicine Finland (FIMM), Helsinki, Finland. artomov@broadinstitute.org.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH