Assembly of a pan-genome from deep sequencing of 910 humans of African descent.

Black People / genetics Genome, Human / genetics High-Throughput Nucleotide Sequencing / methods Humans Sequence Analysis, DNA / methods

Journal

Nature genetics

ISSN: 1546-1718

Titre abrégé: Nat Genet

Pays: United States

ID NLM: 9216904

Informations de publication

Date de publication:
01 2019

Historique:

received: 20 11 2017

accepted: 08 10 2018

pubmed: 21 11 2018

medline: 25 4 2019

entrez: 21 11 2018

Statut: ppublish

Résumé

We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.

Identifiants

DOI: 10.1038/s41588-018-0273-y PMID: 30455414 PMC: PMC6309586

pubmed: 30455414

doi: 10.1038/s41588-018-0273-y

pii: 10.1038/s41588-018-0273-y

pmc: PMC6309586

mid: NIHMS1509230

doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

Pagination

30-35

Subventions

Organisme : NHGRI NIH HHS

ID : R01 HG006677

Pays : United States

Organisme : NHLBI NIH HHS

ID : R01 HL129239

Pays : United States

Organisme : NIGMS NIH HHS

ID : U54 GM115428

Pays : United States

Organisme : NHLBI NIH HHS

ID : R01 HL104608

Pays : United States

Organisme : NIAID NIH HHS

ID : R01 AI132476

Pays : United States

Commentaires et corrections

Type : ErratumIn

Type : CommentIn

Références

International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

doi: 10.1038/35057062

Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

doi: 10.1126/science.1058040

Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).

doi: 10.1101/gr.213611.116

Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010).

doi: 10.1126/science.1188021

E pluribus unum. Nat Methods 7, 331 (2010).

Need, A. C. & Goldstein, D. B. Next generation disparities in human genomics: concerns and remedies. Trends Genet. 25, 489–494 (2009).

doi: 10.1016/j.tig.2009.09.012

Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538, 161–164 (2016).

doi: 10.1038/538161a

Church, D. M. et al. Extending reference assembly models. Genome. Biol. 16, 13 (2015).

doi: 10.1186/s13059-015-0587-3

Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

doi: 10.1093/nar/29.1.308

The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

doi: 10.1038/nature15393

Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).

doi: 10.1038/nbt.1596

Seo, J. S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).

doi: 10.1038/nature20098

Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

doi: 10.1038/sdata.2016.25

Shi, L. et al. Long-read sequencing and de novo assembly of a Chinese genome. Nat. Commun. 7, 12065 (2016).

doi: 10.1038/ncomms12065

Cho, Y. S. et al. An ethnically relevant consensus Korean reference genome is a step towards personal reference genomes. Nat. Commun. 7, 13637 (2016).

doi: 10.1038/ncomms13637

Kehr, B., Melsted, P. & Halldorsson, B. V. PopIns: population-scale detection of novel sequence insertions. Bioinformatics 32, 961–967 (2016).

doi: 10.1093/bioinformatics/btv273

Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).

doi: 10.1038/nature23264

Hehir-Kwa, J. Y. et al. A high-quality human reference panel reveals the complexity and distribution of genomic structural variants. Nat. Commun. 7, 12989 (2016).

doi: 10.1038/ncomms12989

Kehr, B. et al. Diversity in non-repetitive human sequences not found in the reference genome. Nat. Genet. 49, 588–593 (2017).

doi: 10.1038/ng.3801

Gordienko, E. N., Kazanov, M. D. & Gelfand, M. S. Evolution of pan-genomes of Escherichia coli, Shigella spp., and Salmonella enterica. J. Bacteriol. 195, 2786–2792 (2013).

doi: 10.1128/JB.02285-12

Tettelin, H. et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc. Natl Acad. Sci. USA 102, 13950–13955 (2005).

doi: 10.1073/pnas.0506758102

Vernikos, G., Medini, D., Riley, D. R. & Tettelin, H. Ten years of pan-genome analyses. Curr. Opin. Microbiol. 23, 148–154 (2015).

doi: 10.1016/j.mib.2014.11.016

Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl Acad. Sci. USA 113, 11901–11906 (2016).

doi: 10.1073/pnas.1613365113

Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).

doi: 10.1101/gr.214007.116

Mathias, R. A. et al. A continuum of admixture in the Western Hemisphere revealed by the African Diaspora genome. Nat. Commun. 7, 12522 (2016).

doi: 10.1038/ncomms12522

Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).

doi: 10.1186/1471-2105-10-421

Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).

doi: 10.1038/nature18964

Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

doi: 10.1093/bioinformatics/btp324

Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

doi: 10.1038/nmeth.1923

Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

doi: 10.1093/bioinformatics/btp352

Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).

doi: 10.1093/bioinformatics/btt476

Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).

doi: 10.1101/gr.210641.116

Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics Chapter 4, Unit 4.10, (2009).

Delcher, A. L., Salzberg, S. L. & Phillippy, A. M. Using MUMmer to identify similar regions in large sequence sets. Curr. Protoc. Bioinformatics Chapter 10, Unit 10.13, (2003).

Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

doi: 10.1093/bioinformatics/btq033

Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome. Biol. 15, R46 (2014).

doi: 10.1186/gb-2014-15-3-r46

Assembly of a pan-genome from deep sequencing of 910 humans of African descent.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Commentaires et corrections

Références

Auteurs

Articles similaires

Classifications MeSH