The status of the human gene catalogue.
Journal
Nature
ISSN: 1476-4687
Titre abrégé: Nature
Pays: England
ID NLM: 0410462
Informations de publication
Date de publication:
Oct 2023
Oct 2023
Historique:
received:
09
03
2023
accepted:
27
07
2023
pmc-release:
04
04
2024
medline:
6
10
2023
pubmed:
5
10
2023
entrez:
4
10
2023
Statut:
ppublish
Résumé
Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.
Identifiants
pubmed: 37794265
doi: 10.1038/s41586-023-06490-x
pii: 10.1038/s41586-023-06490-x
pmc: PMC10575709
mid: NIHMS1935580
doi:
Substances chimiques
Protein Isoforms
0
RNA
63231-63-0
Types de publication
Journal Article
Review
Langues
eng
Sous-ensembles de citation
IM
Pagination
41-47Subventions
Organisme : NHGRI NIH HHS
ID : R01 HG006677
Pays : United States
Organisme : NIGMS NIH HHS
ID : R35 GM130151
Pays : United States
Organisme : Intramural NIH HHS
ID : Z99 LM999999
Pays : United States
Organisme : NHGRI NIH HHS
ID : U41 HG007234
Pays : United States
Organisme : NHGRI NIH HHS
ID : U24 HG007234
Pays : United States
Organisme : Wellcome Trust
Pays : United Kingdom
Organisme : NIMH NIH HHS
ID : R01 MH123567
Pays : United States
Commentaires et corrections
Type : UpdateOf
Informations de copyright
© 2023. Springer Nature Limited.
Références
Understanding our Genetic Inheritance: The US Human Genome Project, The First Five Years 1991-1995 (US Department of Health and Human Services, US Department of Energy, 1990).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022). Describes the first complete gap-free assembly and annotation of a human genome, which added 140 protein-coding genes and several thousand additional non-coding genes to the human gene catalogue.
pubmed: 35357919
pmcid: 9186530
doi: 10.1126/science.abj6987
The Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
pmcid: 3439153
doi: 10.1038/nature11247
Kawaji, H., Kasukawa, T., Forrest, A., Carninci, P. & Hayashizaki, Y. The FANTOM5 collection, a data series underpinning mammalian transcriptome atlases in diverse cell types. Sci. Data 4, 170113 (2017).
pubmed: 28850107
pmcid: 5574373
doi: 10.1038/sdata.2017.113
Fields, C., Adams, M. D., White, O. & Venter, J. C. How many genes in the human genome? Nat. Genet. 7, 345–346 (1994).
pubmed: 7920649
doi: 10.1038/ng0794-345
Clamp, M. et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl Acad. Sci. USA 104, 19428–19433 (2007).
pubmed: 18040051
pmcid: 2148306
doi: 10.1073/pnas.0709013104
Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005). Demonstrated that transcription is far more complex than previously thought, including large numbers of isoforms and more lncRNAs than protein-coding genes.
Katayama, S. et al. Antisense transcription in the mammalian transcriptome. Science 309, 1564–1566 (2005).
Salzberg, S. L. Next-generation genome annotation: we still struggle to get it right. Genome Biol. 20, 92 (2019).
pubmed: 31097009
pmcid: 6521345
doi: 10.1186/s13059-019-1715-2
Frankish, A. et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 51, D942–D949 (2023).
pubmed: 36420896
doi: 10.1093/nar/gkac1071
O'Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–745 (2016).
pubmed: 26553804
doi: 10.1093/nar/gkv1189
Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 19, 208 (2018). Presents an enhanced and comprehensive catalogue of human genes and transcripts based on very deep RNA-seq across a broad sample of human tissues.
UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).
doi: 10.1093/nar/gkaa1100
Pockrandt, C., Steinegger, M. & Salzberg, S. L. PhyloCSF++: a fast and user-friendly implementation of PhyloCSF with annotation tools. Bioinformatics https://doi.org/10.1093/bioinformatics/btab756 (2021).
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
pubmed: 16024819
pmcid: 1182216
doi: 10.1101/gr.3715005
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
pubmed: 19858363
pmcid: 2798823
doi: 10.1101/gr.097857.109
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
doi: 10.1038/35057062
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
pubmed: 11181995
doi: 10.1126/science.1058040
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
doi: 10.1038/nature03001
Pertea, M. & Salzberg, S. L. Between a chicken and a grape: estimating the number of human genes. Genome Biol. 11, 206 (2010). Reviews the history of efforts to estimate the human gene count and highlights different computational methods that were used to help with the human gene annotation.
pubmed: 20441615
pmcid: 2898077
doi: 10.1186/gb-2010-11-5-206
Pruitt, K. D. et al. The consensus coding sequence (CCDS) project: identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 19, 1316–1323 (2009). Describes a joint effort among three genome annotation centres to converge on coding regions for the annotation of the human and mouse reference genomes.
pubmed: 19498102
pmcid: 2704439
doi: 10.1101/gr.080531.108
Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022). Describes a project to create uniform transcript annotations for every protein-coding gene, therefore enhancing the precision of genomic medicine through the accurate identification of genomic variations.
pubmed: 35388217
pmcid: 9007741
doi: 10.1038/s41586-022-04558-8
Alioto, T. S. U12DB: a database of orthologous U12-type spliceosomal introns. Nucleic Acids Res. 35, D110–115 (2007).
pubmed: 17082203
doi: 10.1093/nar/gkl796
Mudge, J. M. et al. Standardized annotation of translated open reading frames. Nat. Biotechnol. 40, 994–999 (2022). Outlines a community-led effort to produce a standardized catalogue of human ORFs identified through ribosome profiling.
pubmed: 35831657
pmcid: 9757701
doi: 10.1038/s41587-022-01369-0
The GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Troskie, R. L. et al. Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome. Genome Biol. 22, 146 (2021).
pubmed: 33971925
pmcid: 8108447
doi: 10.1186/s13059-021-02369-0
Sun, M. et al. Systematic functional interrogation of human pseudogenes using CRISPRi. Genome Biol. 22, 240 (2021).
pubmed: 34425866
pmcid: 8381491
doi: 10.1186/s13059-021-02464-2
Xu, J. & Zhang, J. Are human translated pseudogenes functional? Mol. Biol. Evol. 33, 755–760 (2016).
pubmed: 26589994
doi: 10.1093/molbev/msv268
Ramilowski, J. A. et al. Functional annotation of human long noncoding RNAs via molecular phenotyping. Genome Res. 30, 1060–1072 (2020).
pubmed: 32718982
pmcid: 7397864
doi: 10.1101/gr.254219.119
Cech, T. R. & Steitz, J. A. The noncoding RNA revolution—trashing old rules to forge new ones. Cell 157, 77–94 (2014).
pubmed: 24679528
doi: 10.1016/j.cell.2014.03.008
Mattick, J. S. et al. Long non-coding RNAs: definitions, functions, challenges and recommendations. Nat. Rev. Mol. Cell Biol. https://doi.org/10.1038/s41580-022-00566-8 (2023).
Michelini, F. et al. Damage-induced lncRNAs control the DNA damage response through interaction with DDRNAs at individual double-strand breaks. Nat. Cell Biol. 19, 1400–1411 (2017).
pubmed: 29180822
pmcid: 5714282
doi: 10.1038/ncb3643
Lagarde, J. et al. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat. Genet. 49, 1731–1740 (2017). Describes a large-scale application of capturing rare RNA species with antisense probes and sequencing them with long-read technology, which revealed a large number of isoforms that were not otherwise detectable.
pubmed: 29106417
pmcid: 5709232
doi: 10.1038/ng.3988
Uszczynska-Ratajczak, B., Lagarde, J., Frankish, A., Guigó, R. & Johnson, R. Towards a complete map of the human long non-coding RNA transcriptome. Nat. Rev. Genet. 19, 535–548 (2018).
pubmed: 29795125
pmcid: 6451964
doi: 10.1038/s41576-018-0017-y
The RNAcentral Consortium. RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res. 49, D212–220 (2021).
doi: 10.1093/nar/gkaa921
Liu, Y. et al. High-plex protein and whole transcriptome co-mapping at cellular resolution with spatial CITE-seq. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01676-0 (2023).
Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).
pubmed: 22955988
pmcid: 3431493
doi: 10.1101/gr.132159.111
Stokes, T. et al. Transcriptomics for clinical and experimental biology research: hang on a seq. Adv. Genet. 4, 2200024 (2023).
Deveson, I. W. et al. Universal alternative splicing of noncoding exons. Cell Syst. 6, 245–255 (2018). Describes widespread alternative splicing in non-coding exons, suggesting that non-coding exons are functionally modular and produce a seemingly limitless variety of isoforms.
pubmed: 29396323
doi: 10.1016/j.cels.2017.12.005
Mudge, J. M. et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res. 29, 2073–2087 (2019).
pubmed: 31537640
pmcid: 6886504
doi: 10.1101/gr.246462.118
Lewandowski, J. P. et al. The Tug1 lncRNA locus is essential for male fertility. Genome Biol. 21, 237 (2020).
pubmed: 32894169
pmcid: 7487648
doi: 10.1186/s13059-020-02081-5
Broadwell, L. J. et al. Myosin 7b is a regulatory long noncoding RNA (lncMYH7b) in the human heart. J. Biol. Chem. 296, 100694 (2021).
pubmed: 33895132
pmcid: 8141895
doi: 10.1016/j.jbc.2021.100694
He, Y. et al. Transcriptional-readthrough RNAs reflect the phenomenon of “a gene contains gene(s)” or “gene(s) within a gene” in the human genome, and thus are not chimeric RNAs. Genes 9, 40 (2018).
Wang, Y. et al. Identification of the cross-strand chimeric RNAs generated by fusions of bi-directional transcripts. Nat. Commun. 12, 4645 (2021).
pubmed: 34330918
pmcid: 8324879
doi: 10.1038/s41467-021-24910-2
de Hoon, M., Shin, J. W. & Carninci, P. Paradigm shifts in genomics through the FANTOM projects. Mamm. Genome 26, 391–402 (2015).
pubmed: 26253466
pmcid: 4602071
doi: 10.1007/s00335-015-9593-8
Yip, C. W. et al. Antisense-oligonucleotide-mediated perturbation of long non-coding RNA reveals functional features in stem cells and across cell types. Cell Rep. 41, 111893 (2022).
pubmed: 36577377
doi: 10.1016/j.celrep.2022.111893
Seal, R. L. et al. A guide to naming human non-coding RNA genes. EMBO J. 39, e103777 (2020).
pubmed: 32090359
pmcid: 7073466
doi: 10.15252/embj.2019103777
Amberger, J. S., Bocchini, C. A., Scott, A. F. & Hamosh, A. OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 47, D1038–D1043 (2019).
pubmed: 30445645
doi: 10.1093/nar/gky1151
Cline, M. S. et al. BRCA challenge: BRCA exchange as a global resource for variants in BRCA1 and BRCA2. PLoS Genet. 14, e1007752 (2018).
pubmed: 30586411
pmcid: 6324924
doi: 10.1371/journal.pgen.1007752
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
pubmed: 20601685
pmcid: 2938201
doi: 10.1093/nar/gkq603
Hunt, S. E. et al. Annotating and prioritizing genomic variants using the Ensembl Variant Effect Predictor-A tutorial. Hum. Mutat. 43, 986–997 (2022).
pubmed: 34816521
doi: 10.1002/humu.24298
Schoch, K. et al. Alternative transcripts in variant interpretation: the potential for missed diagnoses and misdiagnoses. Genet. Med. 22, 1269–1275 (2020). A potent example of the considerable impact that precise gene model annotation has on genetic diagnostics, demonstrating how inaccuracies can yield false negatives or positives and potentially compromising the diagnosis of rare disease patients.
pubmed: 32366967
pmcid: 7335342
doi: 10.1038/s41436-020-0781-x
Steward, C. A. et al. Re-annotation of 191 developmental and epileptic encephalopathy-associated genes unmasks de novo variants in SCN1A. NPJ Genom. Med. 4, 31 (2019).
pubmed: 31814998
pmcid: 6889285
doi: 10.1038/s41525-019-0106-7
Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
Bartonicek, N. et al. Intergenic disease-associated regions are abundant in novel transcripts. Genome Biol. 18, 241 (2017).
pubmed: 29284497
pmcid: 5747244
doi: 10.1186/s13059-017-1363-3
Aznaourova, M., Schmerer, N., Schmeck, B. & Schulte, L. N. Disease-causing mutations and rearrangements in long non-coding RNA gene loci. Front. Genet. 11, 527484 (2020).
pubmed: 33329688
pmcid: 7735109
doi: 10.3389/fgene.2020.527484
den Dunnen, J. T. et al. HGVS recommendations for the description of sequence variants: 2016 update. Hum. Mutat. 37, 564–569 (2016).
doi: 10.1002/humu.22981
Shumate, A. et al. Assembly and annotation of an Ashkenazi human reference genome. Genome Biol. 21, 129 (2020).
pubmed: 32487205
pmcid: 7265644
doi: 10.1186/s13059-020-02047-7
Zimin, A. V. et al. A reference-quality, fully annotated genome from a Puerto Rican individual. Genetics 220, iyab227 (2022).
Chao, K. H., Zimin, A. V., Pertea, M. & Salzberg, S. L. The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual. G3: Genes, Genomes, Genetic0s 13,jkac321 (2023).
Liao, W. W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
pubmed: 37165242
pmcid: 10172123
doi: 10.1038/s41586-023-05896-x
The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).
doi: 10.1038/nature13182
Gonzàlez-Porta, M., Frankish, A., Rung, J., Harrow, J. & Brazma, A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 14, R70 (2013).
pubmed: 23815980
pmcid: 4053754
doi: 10.1186/gb-2013-14-7-r70
Okazaki, Y. et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–573 (2002).
pubmed: 12466851
doi: 10.1038/nature01266
Babarinde, I. A. & Hutchins, A. P. The effects of sequencing depth on the assembly of coding and noncoding transcripts in the human genome. BMC Genom. 23, 487 (2022).
doi: 10.1186/s12864-022-08717-z
Weatheritt, R. J., Sterne-Weiler, T. & Blencowe, B. J. The ribosome-engaged landscape of alternative splicing. Nat. Struct. Mol. Biol. 23, 1117–1123 (2016).
pubmed: 27820807
pmcid: 5295628
doi: 10.1038/nsmb.3317
van Heesch, S. et al. The translational landscape of the human heart. Cell 178, 242–260 (2019). Shows that combining ribosome profiling with deep proteomic analysis can detect peptide products translated from a large number of 5′-UTRs and annotated lncRNAs.
pubmed: 31155234
doi: 10.1016/j.cell.2019.05.010
Duffy, E. E. et al. Developmental dynamics of RNA translation in the human brain. Nat. Neurosci. 25, 1353–1365 (2022).
pubmed: 36171426
pmcid: 10198132
doi: 10.1038/s41593-022-01164-9
Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305 (2019).
pubmed: 31740818
pmcid: 7768885
doi: 10.1038/s41592-019-0617-2
Mulroney, L. et al. Identification of high-confidence human poly(A) RNA isoform scaffolds using nanopore sequencing. RNA 28, 162–176 (2022).
pubmed: 34728536
pmcid: 8906549
doi: 10.1261/rna.078703.121
Grapotte, M. et al. Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network. Nat. Commun. 12, 3297 (2021).
pubmed: 34078885
pmcid: 8172540
doi: 10.1038/s41467-021-23143-7
Sinitcyn, P. et al. Global detection of human variants and isoforms by deep proteome sequencing. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01714-x (2023). Establishes a valuable resource for the identification of isoforms at the proteome level, and provides direct evidence that most frame-preserving alternatively spliced isoforms are translated.
Glinos, D. A. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608, 353–359 (2022).
pubmed: 35922509
pmcid: 10337767
doi: 10.1038/s41586-022-05035-y
Mercer, T. R. et al. Targeted sequencing for gene discovery and quantification using RNA CaptureSeq. Nat. Protoc. 9, 989–1009 (2014).
pubmed: 24705597
doi: 10.1038/nprot.2014.058
Curion, F. et al. Targeted RNA sequencing enhances gene expression profiling of ultra-low input samples. RNA Biol. 17, 1741–1753 (2020).
pubmed: 32597303
pmcid: 7746246
doi: 10.1080/15476286.2020.1777768
Zhao, L. et al. NONCODEV6: an updated database dedicated to long non-coding RNA annotation in both animals and plants. Nucleic Acids Res. 49, D165–D171 (2021).
pubmed: 33196801
doi: 10.1093/nar/gkaa1046
Hon, C.-C. et al. An atlas of human long non-coding RNAs with accurate 5′ ends. Nature 543, 199–204 (2017).
pubmed: 28241135
pmcid: 6857182
doi: 10.1038/nature21374
Volders, P.-J. et al. LNCipedia 5: towards a reference set of human long non-coding RNAs. Nucleic Acids Res. 47, D135–139 (2019).
pubmed: 30371849
doi: 10.1093/nar/gky1031
Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).
pubmed: 25599403
pmcid: 4417758
doi: 10.1038/ng.3192
Ma, L. et al. LncBook: a curated knowledgebase of human long non-coding RNAs. Nucleic Acids Res. 47, 2699–2699 (2019).
pubmed: 30715521
pmcid: 6412109
doi: 10.1093/nar/gkz073