Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.


Journal

Genome research
ISSN: 1549-5469
Titre abrégé: Genome Res
Pays: United States
ID NLM: 9518021

Informations de publication

Date de publication:
12 2019
Historique:
received: 15 11 2018
accepted: 09 09 2019
pubmed: 21 9 2019
medline: 16 4 2020
entrez: 21 9 2019
Statut: ppublish

Résumé

The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization.

Identifiants

pubmed: 31537640
pii: gr.246462.118
doi: 10.1101/gr.246462.118
pmc: PMC6886504
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

2073-2087

Subventions

Organisme : Wellcome Trust
ID : WT108749/Z/15/Z
Pays : United Kingdom
Organisme : Wellcome Trust
ID : 208349/Z/17/Z
Pays : United Kingdom
Organisme : NHGRI NIH HHS
ID : R01 HG004037
Pays : United States
Organisme : NHGRI NIH HHS
ID : U41 HG007234
Pays : United States
Organisme : NHGRI NIH HHS
ID : U24 HG003345
Pays : United States
Organisme : Wellcome Trust
Pays : United Kingdom

Informations de copyright

© 2019 Mudge et al.; Published by Cold Spring Harbor Laboratory Press.

Références

BMC Cancer. 2019 Jun 24;19(1):617
pubmed: 31234830
Nature. 2014 Mar 27;507(7493):462-70
pubmed: 24670764
Science. 2017 Apr 21;356(6335):323-327
pubmed: 28386024
EMBO J. 2014 May 2;33(9):981-93
pubmed: 24705786
Nature. 2013 Jan 10;493(7431):231-5
pubmed: 23201690
Nature. 2014 May 29;509(7502):582-7
pubmed: 24870543
Mol Syst Biol. 2011 Oct 11;7:539
pubmed: 21988835
Nucleic Acids Res. 2016 Dec 15;44(22):11033
pubmed: 27683222
Nat Genet. 2018 Jun;50(6):834-848
pubmed: 29808027
Genome Biol. 2012 Sep 26;13(9):R51
pubmed: 22951037
J Comput Biol. 2004;11(2-3):377-94
pubmed: 15285897
J Proteome Res. 2016 Dec 2;15(12):4686-4695
pubmed: 27786492
Cell Rep. 2018 Jun 26;23(13):3701-3709
pubmed: 29949755
Nucleic Acids Res. 2017 Jan 4;45(D1):D12-D17
pubmed: 27899561
Nat Cell Biol. 2017 Apr;19(4):384-390
pubmed: 28263957
BMC Genomics. 2013 Jul 18;14:486
pubmed: 23865674
Nat Rev Genet. 2018 Sep;19(9):535-548
pubmed: 29795125
BMC Dev Biol. 2007 Jun 13;7:67
pubmed: 17567914
Genome Biol Evol. 2011;3:1449-62
pubmed: 22094861
Genome Res. 2005 Aug;15(8):1034-50
pubmed: 16024819
Nat Commun. 2016 Jun 02;7:11778
pubmed: 27250503
Int J Dev Biol. 1997 Jun;41(3):449-58
pubmed: 9240561
Science. 2010 Dec 24;330(6012):1787-97
pubmed: 21177974
Bioinformatics. 2011 Jul 1;27(13):i275-82
pubmed: 21685081
Trends Biochem Sci. 2017 Jun;42(6):407-408
pubmed: 28483376
Cell. 2011 Nov 11;147(4):789-802
pubmed: 22056041
Nat Methods. 2014 Nov;11(11):1114-25
pubmed: 25357241
Genome Res. 2013 Dec;23(12):1961-73
pubmed: 24172201
Biochemistry. 2018 Sep 25;57(38):5564-5575
pubmed: 30215512
Proc Natl Acad Sci U S A. 2009 May 5;106(18):7507-12
pubmed: 19372376
EMBO J. 2016 Apr 1;35(7):706-23
pubmed: 26896445
Science. 2017 May 26;356(6340):
pubmed: 28495876
Proc Natl Acad Sci U S A. 2014 Sep 16;111(37):13361-6
pubmed: 25157146
Development. 2019 Mar 28;146(6):null
pubmed: 30923056
Nucleic Acids Res. 2011 Sep 1;39(16):7103-13
pubmed: 21586586
Cell Death Dis. 2019 Feb 15;10(3):154
pubmed: 30770799
Science. 2009 Apr 10;324(5924):218-23
pubmed: 19213877
RNA. 2015 Apr;21(4):652-4
pubmed: 25780177
Nat Genet. 2013 Mar;45(3):314-8
pubmed: 23396134
Nat Biotechnol. 2015 Jul;33(7):736-42
pubmed: 25985263
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515
pubmed: 30395287
Mol Cell Proteomics. 2018 Dec;17(12):2402-2411
pubmed: 30181344
Nature. 2011 Oct 12;478(7370):476-82
pubmed: 21993624
Genome Biol. 2015 Sep 14;16:179
pubmed: 26364619
Nucleic Acids Res. 2017 Jan 4;45(D1):D177-D182
pubmed: 27899619
Science. 2016 Jan 15;351(6270):271-5
pubmed: 26816378
Anal Chem. 2016 Apr 5;88(7):3967-75
pubmed: 27010111
Nucleic Acids Res. 2018 Aug 21;46(14):7070-7084
pubmed: 29982784
Nat Chem Biol. 2013 Jan;9(1):59-64
pubmed: 23160002
Nucleic Acids Res. 2018 Jan 4;46(D1):D762-D769
pubmed: 29106570
Cell. 2019 Jun 27;178(1):242-260.e29
pubmed: 31155234
Cancer Lett. 2019 Sep 10;459:86-99
pubmed: 31173852
Ann N Y Acad Sci. 2017 Jun;1397(1):35-53
pubmed: 28415133
Genome Res. 2012 Sep;22(9):1760-74
pubmed: 22955987
Genome Biol. 2018 Nov 28;19(1):208
pubmed: 30486838
Genome Res. 2007 Dec;17(12):1823-36
pubmed: 17989253
Nat Genet. 2017 Dec;49(12):1731-1740
pubmed: 29106417
EBioMedicine. 2019 Jul;45:58-69
pubmed: 31202814
Elife. 2016 May 27;5:
pubmed: 27232982
Science. 2001 Feb 16;291(5507):1304-51
pubmed: 11181995
Nucleic Acids Res. 2018 Jan 4;46(D1):D754-D761
pubmed: 29155950
Nucleic Acids Res. 2019 Jan 8;47(D1):D766-D773
pubmed: 30357393
Genome Biol. 2018 Jul 9;19(1):86
pubmed: 29986741
Mol Biol Evol. 2016 Dec;33(12):3108-3132
pubmed: 27604222
Nature. 2001 Feb 15;409(6822):860-921
pubmed: 11237011
Mol Cell. 2018 Nov 1;72(3):553-567.e5
pubmed: 30401432
Genes Dev. 2017 Sep 1;31(17):1717-1731
pubmed: 28982758
Trends Biochem Sci. 2017 Jun;42(6):408-410
pubmed: 28483377
Nat Rev Genet. 2016 Dec;17(12):758-772
pubmed: 27773922
PLoS Comput Biol. 2008 Apr 18;4(4):e1000067
pubmed: 18421375
Nat Immunol. 2019 Jul;20(7):812-823
pubmed: 31036902
Trends Genet. 2015 Apr;31(4):215-9
pubmed: 25773713
Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45
pubmed: 26553804
Genome Res. 2012 Jun;22(6):1173-83
pubmed: 22454233
Genome Res. 2011 Dec;21(12):2096-113
pubmed: 21994247
Mol Cell. 2015 Dec 3;60(5):816-827
pubmed: 26638175
Biochim Biophys Acta. 2013 Mar;1830(3):2728-38
pubmed: 23671934
Nucleic Acids Res. 2006 Jan 1;34(Database issue):D655-8
pubmed: 16381952
Nature. 2014 May 29;509(7502):575-81
pubmed: 24870542
Nature. 2013 Apr 25;496(7446):498-503
pubmed: 23594743
Genome Res. 2010 Jan;20(1):110-21
pubmed: 19858363
Cell. 2013 Jul 3;154(1):240-51
pubmed: 23810193
Elife. 2015 Jan 26;4:e03971
pubmed: 25621764
Genome Biol. 2016 Dec 30;17(1):266
pubmed: 28038678
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
Nucleic Acids Res. 2018 Jan 4;46(D1):D497-D502
pubmed: 29140531
Nature. 2017 Jan 12;541(7636):228-232
pubmed: 28024296
Mol Cell. 2014 Mar 20;53(6):1005-19
pubmed: 24530304

Auteurs

Jonathan M Mudge (JM)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

Irwin Jungreis (I)

MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA.
Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Toby Hunt (T)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

Jose Manuel Gonzalez (JM)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

James C Wright (JC)

Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, London SW7 3RP, United Kingdom.

Mike Kay (M)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

Claire Davidson (C)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

Stephen Fitzgerald (S)

Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.

Ruth Seal (R)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.
Department of Haematology, University of Cambridge, Cambridge CB2 0PT, United Kingdom.

Susan Tweedie (S)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

Liang He (L)

MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA.
Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Robert M Waterhouse (RM)

Department of Ecology and Evolution, University of Lausanne, Lausanne 1015, Switzerland.
Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland.

Yue Li (Y)

MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA.
Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Elspeth Bruford (E)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.
Department of Haematology, University of Cambridge, Cambridge CB2 0PT, United Kingdom.

Jyoti S Choudhary (JS)

Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, London SW7 3RP, United Kingdom.

Adam Frankish (A)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

Manolis Kellis (M)

MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA.
Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH