The status of the human gene catalogue.

Humans Genome, Human / genetics Molecular Sequence Annotation / standards Protein Isoforms / genetics Human Genome Project Genes Pseudogenes RNA / genetics

Journal

Nature

ISSN: 1476-4687

Titre abrégé: Nature

Pays: England

ID NLM: 0410462

Informations de publication

Date de publication:
Oct 2023

Historique:

received: 09 03 2023

accepted: 27 07 2023

pmc-release: 04 04 2024

medline: 6 10 2023

pubmed: 5 10 2023

entrez: 4 10 2023

Statut: ppublish

Résumé

Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.

Identifiants

DOI: 10.1038/s41586-023-06490-x PMID: 37794265 PMC: PMC10575709

pubmed: 37794265

doi: 10.1038/s41586-023-06490-x

pii: 10.1038/s41586-023-06490-x

pmc: PMC10575709

mid: NIHMS1935580

doi:

Substances chimiques

Protein Isoforms 0

RNA 63231-63-0

Types de publication

Journal Article Review

Langues

eng

Sous-ensembles de citation

Pagination

41-47

Subventions

Organisme : NHGRI NIH HHS

ID : R01 HG006677

Pays : United States

Organisme : NIGMS NIH HHS

ID : R35 GM130151

Pays : United States

Organisme : Intramural NIH HHS

ID : Z99 LM999999

Pays : United States

Organisme : NHGRI NIH HHS

ID : U41 HG007234

Pays : United States

Organisme : NHGRI NIH HHS

ID : U24 HG007234

Pays : United States

Organisme : Wellcome Trust

Pays : United Kingdom

Organisme : NIMH NIH HHS

ID : R01 MH123567

Pays : United States

Commentaires et corrections

Type : UpdateOf

Informations de copyright

Références

Understanding our Genetic Inheritance: The US Human Genome Project, The First Five Years 1991-1995 (US Department of Health and Human Services, US Department of Energy, 1990).

Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022). Describes the first complete gap-free assembly and annotation of a human genome, which added 140 protein-coding genes and several thousand additional non-coding genes to the human gene catalogue.

pubmed: 35357919 pmcid: 9186530 doi: 10.1126/science.abj6987

The Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

pmcid: 3439153 doi: 10.1038/nature11247

Kawaji, H., Kasukawa, T., Forrest, A., Carninci, P. & Hayashizaki, Y. The FANTOM5 collection, a data series underpinning mammalian transcriptome atlases in diverse cell types. Sci. Data 4, 170113 (2017).

pubmed: 28850107 pmcid: 5574373 doi: 10.1038/sdata.2017.113

Fields, C., Adams, M. D., White, O. & Venter, J. C. How many genes in the human genome? Nat. Genet. 7, 345–346 (1994).

pubmed: 7920649 doi: 10.1038/ng0794-345

Clamp, M. et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl Acad. Sci. USA 104, 19428–19433 (2007).

pubmed: 18040051 pmcid: 2148306 doi: 10.1073/pnas.0709013104

Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005). Demonstrated that transcription is far more complex than previously thought, including large numbers of isoforms and more lncRNAs than protein-coding genes.

Katayama, S. et al. Antisense transcription in the mammalian transcriptome. Science 309, 1564–1566 (2005).

Salzberg, S. L. Next-generation genome annotation: we still struggle to get it right. Genome Biol. 20, 92 (2019).

pubmed: 31097009 pmcid: 6521345 doi: 10.1186/s13059-019-1715-2

Frankish, A. et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 51, D942–D949 (2023).

pubmed: 36420896 doi: 10.1093/nar/gkac1071

O'Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–745 (2016).

pubmed: 26553804 doi: 10.1093/nar/gkv1189

Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 19, 208 (2018). Presents an enhanced and comprehensive catalogue of human genes and transcripts based on very deep RNA-seq across a broad sample of human tissues.

UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).

doi: 10.1093/nar/gkaa1100

Pockrandt, C., Steinegger, M. & Salzberg, S. L. PhyloCSF++: a fast and user-friendly implementation of PhyloCSF with annotation tools. Bioinformatics https://doi.org/10.1093/bioinformatics/btab756 (2021).

Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

pubmed: 16024819 pmcid: 1182216 doi: 10.1101/gr.3715005

Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).

pubmed: 19858363 pmcid: 2798823 doi: 10.1101/gr.097857.109

International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

doi: 10.1038/35057062

Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

pubmed: 11181995 doi: 10.1126/science.1058040

International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

doi: 10.1038/nature03001

Pertea, M. & Salzberg, S. L. Between a chicken and a grape: estimating the number of human genes. Genome Biol. 11, 206 (2010). Reviews the history of efforts to estimate the human gene count and highlights different computational methods that were used to help with the human gene annotation.

pubmed: 20441615 pmcid: 2898077 doi: 10.1186/gb-2010-11-5-206

Pruitt, K. D. et al. The consensus coding sequence (CCDS) project: identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 19, 1316–1323 (2009). Describes a joint effort among three genome annotation centres to converge on coding regions for the annotation of the human and mouse reference genomes.

pubmed: 19498102 pmcid: 2704439 doi: 10.1101/gr.080531.108

Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022). Describes a project to create uniform transcript annotations for every protein-coding gene, therefore enhancing the precision of genomic medicine through the accurate identification of genomic variations.

pubmed: 35388217 pmcid: 9007741 doi: 10.1038/s41586-022-04558-8

Alioto, T. S. U12DB: a database of orthologous U12-type spliceosomal introns. Nucleic Acids Res. 35, D110–115 (2007).

pubmed: 17082203 doi: 10.1093/nar/gkl796

Mudge, J. M. et al. Standardized annotation of translated open reading frames. Nat. Biotechnol. 40, 994–999 (2022). Outlines a community-led effort to produce a standardized catalogue of human ORFs identified through ribosome profiling.

pubmed: 35831657 pmcid: 9757701 doi: 10.1038/s41587-022-01369-0

The GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).

Troskie, R. L. et al. Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome. Genome Biol. 22, 146 (2021).

pubmed: 33971925 pmcid: 8108447 doi: 10.1186/s13059-021-02369-0

Sun, M. et al. Systematic functional interrogation of human pseudogenes using CRISPRi. Genome Biol. 22, 240 (2021).

pubmed: 34425866 pmcid: 8381491 doi: 10.1186/s13059-021-02464-2

Xu, J. & Zhang, J. Are human translated pseudogenes functional? Mol. Biol. Evol. 33, 755–760 (2016).

pubmed: 26589994 doi: 10.1093/molbev/msv268

Ramilowski, J. A. et al. Functional annotation of human long noncoding RNAs via molecular phenotyping. Genome Res. 30, 1060–1072 (2020).

pubmed: 32718982 pmcid: 7397864 doi: 10.1101/gr.254219.119

Cech, T. R. & Steitz, J. A. The noncoding RNA revolution—trashing old rules to forge new ones. Cell 157, 77–94 (2014).

pubmed: 24679528 doi: 10.1016/j.cell.2014.03.008

Mattick, J. S. et al. Long non-coding RNAs: definitions, functions, challenges and recommendations. Nat. Rev. Mol. Cell Biol. https://doi.org/10.1038/s41580-022-00566-8 (2023).

Michelini, F. et al. Damage-induced lncRNAs control the DNA damage response through interaction with DDRNAs at individual double-strand breaks. Nat. Cell Biol. 19, 1400–1411 (2017).

pubmed: 29180822 pmcid: 5714282 doi: 10.1038/ncb3643

Lagarde, J. et al. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat. Genet. 49, 1731–1740 (2017). Describes a large-scale application of capturing rare RNA species with antisense probes and sequencing them with long-read technology, which revealed a large number of isoforms that were not otherwise detectable.

pubmed: 29106417 pmcid: 5709232 doi: 10.1038/ng.3988

Uszczynska-Ratajczak, B., Lagarde, J., Frankish, A., Guigó, R. & Johnson, R. Towards a complete map of the human long non-coding RNA transcriptome. Nat. Rev. Genet. 19, 535–548 (2018).

pubmed: 29795125 pmcid: 6451964 doi: 10.1038/s41576-018-0017-y

The RNAcentral Consortium. RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res. 49, D212–220 (2021).

doi: 10.1093/nar/gkaa921

Liu, Y. et al. High-plex protein and whole transcriptome co-mapping at cellular resolution with spatial CITE-seq. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01676-0 (2023).

Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).

pubmed: 22955988 pmcid: 3431493 doi: 10.1101/gr.132159.111

Stokes, T. et al. Transcriptomics for clinical and experimental biology research: hang on a seq. Adv. Genet. 4, 2200024 (2023).

Deveson, I. W. et al. Universal alternative splicing of noncoding exons. Cell Syst. 6, 245–255 (2018). Describes widespread alternative splicing in non-coding exons, suggesting that non-coding exons are functionally modular and produce a seemingly limitless variety of isoforms.

pubmed: 29396323 doi: 10.1016/j.cels.2017.12.005

Mudge, J. M. et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res. 29, 2073–2087 (2019).

pubmed: 31537640 pmcid: 6886504 doi: 10.1101/gr.246462.118

Lewandowski, J. P. et al. The Tug1 lncRNA locus is essential for male fertility. Genome Biol. 21, 237 (2020).

pubmed: 32894169 pmcid: 7487648 doi: 10.1186/s13059-020-02081-5

Broadwell, L. J. et al. Myosin 7b is a regulatory long noncoding RNA (lncMYH7b) in the human heart. J. Biol. Chem. 296, 100694 (2021).

pubmed: 33895132 pmcid: 8141895 doi: 10.1016/j.jbc.2021.100694

He, Y. et al. Transcriptional-readthrough RNAs reflect the phenomenon of “a gene contains gene(s)” or “gene(s) within a gene” in the human genome, and thus are not chimeric RNAs. Genes 9, 40 (2018).

Wang, Y. et al. Identification of the cross-strand chimeric RNAs generated by fusions of bi-directional transcripts. Nat. Commun. 12, 4645 (2021).

pubmed: 34330918 pmcid: 8324879 doi: 10.1038/s41467-021-24910-2

de Hoon, M., Shin, J. W. & Carninci, P. Paradigm shifts in genomics through the FANTOM projects. Mamm. Genome 26, 391–402 (2015).

pubmed: 26253466 pmcid: 4602071 doi: 10.1007/s00335-015-9593-8

Yip, C. W. et al. Antisense-oligonucleotide-mediated perturbation of long non-coding RNA reveals functional features in stem cells and across cell types. Cell Rep. 41, 111893 (2022).

pubmed: 36577377 doi: 10.1016/j.celrep.2022.111893

Seal, R. L. et al. A guide to naming human non-coding RNA genes. EMBO J. 39, e103777 (2020).

pubmed: 32090359 pmcid: 7073466 doi: 10.15252/embj.2019103777

Amberger, J. S., Bocchini, C. A., Scott, A. F. & Hamosh, A. OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 47, D1038–D1043 (2019).

pubmed: 30445645 doi: 10.1093/nar/gky1151

Cline, M. S. et al. BRCA challenge: BRCA exchange as a global resource for variants in BRCA1 and BRCA2. PLoS Genet. 14, e1007752 (2018).

pubmed: 30586411 pmcid: 6324924 doi: 10.1371/journal.pgen.1007752

Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).

pubmed: 20601685 pmcid: 2938201 doi: 10.1093/nar/gkq603

Hunt, S. E. et al. Annotating and prioritizing genomic variants using the Ensembl Variant Effect Predictor-A tutorial. Hum. Mutat. 43, 986–997 (2022).

pubmed: 34816521 doi: 10.1002/humu.24298

Schoch, K. et al. Alternative transcripts in variant interpretation: the potential for missed diagnoses and misdiagnoses. Genet. Med. 22, 1269–1275 (2020). A potent example of the considerable impact that precise gene model annotation has on genetic diagnostics, demonstrating how inaccuracies can yield false negatives or positives and potentially compromising the diagnosis of rare disease patients.

pubmed: 32366967 pmcid: 7335342 doi: 10.1038/s41436-020-0781-x

Steward, C. A. et al. Re-annotation of 191 developmental and epileptic encephalopathy-associated genes unmasks de novo variants in SCN1A. NPJ Genom. Med. 4, 31 (2019).

pubmed: 31814998 pmcid: 6889285 doi: 10.1038/s41525-019-0106-7

Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).

Bartonicek, N. et al. Intergenic disease-associated regions are abundant in novel transcripts. Genome Biol. 18, 241 (2017).

pubmed: 29284497 pmcid: 5747244 doi: 10.1186/s13059-017-1363-3

Aznaourova, M., Schmerer, N., Schmeck, B. & Schulte, L. N. Disease-causing mutations and rearrangements in long non-coding RNA gene loci. Front. Genet. 11, 527484 (2020).

pubmed: 33329688 pmcid: 7735109 doi: 10.3389/fgene.2020.527484

den Dunnen, J. T. et al. HGVS recommendations for the description of sequence variants: 2016 update. Hum. Mutat. 37, 564–569 (2016).

doi: 10.1002/humu.22981

Shumate, A. et al. Assembly and annotation of an Ashkenazi human reference genome. Genome Biol. 21, 129 (2020).

pubmed: 32487205 pmcid: 7265644 doi: 10.1186/s13059-020-02047-7

Zimin, A. V. et al. A reference-quality, fully annotated genome from a Puerto Rican individual. Genetics 220, iyab227 (2022).

Chao, K. H., Zimin, A. V., Pertea, M. & Salzberg, S. L. The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual. G3: Genes, Genomes, Genetic0s 13,jkac321 (2023).

Liao, W. W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).

pubmed: 37165242 pmcid: 10172123 doi: 10.1038/s41586-023-05896-x

The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).

doi: 10.1038/nature13182

Gonzàlez-Porta, M., Frankish, A., Rung, J., Harrow, J. & Brazma, A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 14, R70 (2013).

pubmed: 23815980 pmcid: 4053754 doi: 10.1186/gb-2013-14-7-r70

Okazaki, Y. et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–573 (2002).

pubmed: 12466851 doi: 10.1038/nature01266

Babarinde, I. A. & Hutchins, A. P. The effects of sequencing depth on the assembly of coding and noncoding transcripts in the human genome. BMC Genom. 23, 487 (2022).

doi: 10.1186/s12864-022-08717-z

Weatheritt, R. J., Sterne-Weiler, T. & Blencowe, B. J. The ribosome-engaged landscape of alternative splicing. Nat. Struct. Mol. Biol. 23, 1117–1123 (2016).

pubmed: 27820807 pmcid: 5295628 doi: 10.1038/nsmb.3317

van Heesch, S. et al. The translational landscape of the human heart. Cell 178, 242–260 (2019). Shows that combining ribosome profiling with deep proteomic analysis can detect peptide products translated from a large number of 5′-UTRs and annotated lncRNAs.

pubmed: 31155234 doi: 10.1016/j.cell.2019.05.010

Duffy, E. E. et al. Developmental dynamics of RNA translation in the human brain. Nat. Neurosci. 25, 1353–1365 (2022).

pubmed: 36171426 pmcid: 10198132 doi: 10.1038/s41593-022-01164-9

Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305 (2019).

pubmed: 31740818 pmcid: 7768885 doi: 10.1038/s41592-019-0617-2

Mulroney, L. et al. Identification of high-confidence human poly(A) RNA isoform scaffolds using nanopore sequencing. RNA 28, 162–176 (2022).

pubmed: 34728536 pmcid: 8906549 doi: 10.1261/rna.078703.121

Grapotte, M. et al. Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network. Nat. Commun. 12, 3297 (2021).

pubmed: 34078885 pmcid: 8172540 doi: 10.1038/s41467-021-23143-7

Sinitcyn, P. et al. Global detection of human variants and isoforms by deep proteome sequencing. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01714-x (2023). Establishes a valuable resource for the identification of isoforms at the proteome level, and provides direct evidence that most frame-preserving alternatively spliced isoforms are translated.

Glinos, D. A. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608, 353–359 (2022).

pubmed: 35922509 pmcid: 10337767 doi: 10.1038/s41586-022-05035-y

Mercer, T. R. et al. Targeted sequencing for gene discovery and quantification using RNA CaptureSeq. Nat. Protoc. 9, 989–1009 (2014).

pubmed: 24705597 doi: 10.1038/nprot.2014.058

Curion, F. et al. Targeted RNA sequencing enhances gene expression profiling of ultra-low input samples. RNA Biol. 17, 1741–1753 (2020).

pubmed: 32597303 pmcid: 7746246 doi: 10.1080/15476286.2020.1777768

Zhao, L. et al. NONCODEV6: an updated database dedicated to long non-coding RNA annotation in both animals and plants. Nucleic Acids Res. 49, D165–D171 (2021).

pubmed: 33196801 doi: 10.1093/nar/gkaa1046

Hon, C.-C. et al. An atlas of human long non-coding RNAs with accurate 5′ ends. Nature 543, 199–204 (2017).

pubmed: 28241135 pmcid: 6857182 doi: 10.1038/nature21374

Volders, P.-J. et al. LNCipedia 5: towards a reference set of human long non-coding RNAs. Nucleic Acids Res. 47, D135–139 (2019).

pubmed: 30371849 doi: 10.1093/nar/gky1031

Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).

pubmed: 25599403 pmcid: 4417758 doi: 10.1038/ng.3192

Ma, L. et al. LncBook: a curated knowledgebase of human long non-coding RNAs. Nucleic Acids Res. 47, 2699–2699 (2019).

pubmed: 30715521 pmcid: 6412109 doi: 10.1093/nar/gkz073

The status of the human gene catalogue.

Journal

Informations de publication

Résumé

Identifiants

Substances chimiques

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Commentaires et corrections

Informations de copyright

Références

Auteurs

Articles similaires

Classifications MeSH