Assembly of a pan-genome from deep sequencing of 910 humans of African descent.
Journal
Nature genetics
ISSN: 1546-1718
Titre abrégé: Nat Genet
Pays: United States
ID NLM: 9216904
Informations de publication
Date de publication:
01 2019
01 2019
Historique:
received:
20
11
2017
accepted:
08
10
2018
pubmed:
21
11
2018
medline:
25
4
2019
entrez:
21
11
2018
Statut:
ppublish
Résumé
We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.
Identifiants
pubmed: 30455414
doi: 10.1038/s41588-018-0273-y
pii: 10.1038/s41588-018-0273-y
pmc: PMC6309586
mid: NIHMS1509230
doi:
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Langues
eng
Sous-ensembles de citation
IM
Pagination
30-35Subventions
Organisme : NHGRI NIH HHS
ID : R01 HG006677
Pays : United States
Organisme : NHLBI NIH HHS
ID : R01 HL129239
Pays : United States
Organisme : NIGMS NIH HHS
ID : U54 GM115428
Pays : United States
Organisme : NHLBI NIH HHS
ID : R01 HL104608
Pays : United States
Organisme : NIAID NIH HHS
ID : R01 AI132476
Pays : United States
Commentaires et corrections
Type : ErratumIn
Type : CommentIn
Références
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
doi: 10.1038/35057062
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
doi: 10.1126/science.1058040
Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
doi: 10.1101/gr.213611.116
Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010).
doi: 10.1126/science.1188021
E pluribus unum. Nat Methods 7, 331 (2010).
Need, A. C. & Goldstein, D. B. Next generation disparities in human genomics: concerns and remedies. Trends Genet. 25, 489–494 (2009).
doi: 10.1016/j.tig.2009.09.012
Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538, 161–164 (2016).
doi: 10.1038/538161a
Church, D. M. et al. Extending reference assembly models. Genome. Biol. 16, 13 (2015).
doi: 10.1186/s13059-015-0587-3
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
doi: 10.1093/nar/29.1.308
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
doi: 10.1038/nature15393
Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).
doi: 10.1038/nbt.1596
Seo, J. S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).
doi: 10.1038/nature20098
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
doi: 10.1038/sdata.2016.25
Shi, L. et al. Long-read sequencing and de novo assembly of a Chinese genome. Nat. Commun. 7, 12065 (2016).
doi: 10.1038/ncomms12065
Cho, Y. S. et al. An ethnically relevant consensus Korean reference genome is a step towards personal reference genomes. Nat. Commun. 7, 13637 (2016).
doi: 10.1038/ncomms13637
Kehr, B., Melsted, P. & Halldorsson, B. V. PopIns: population-scale detection of novel sequence insertions. Bioinformatics 32, 961–967 (2016).
doi: 10.1093/bioinformatics/btv273
Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).
doi: 10.1038/nature23264
Hehir-Kwa, J. Y. et al. A high-quality human reference panel reveals the complexity and distribution of genomic structural variants. Nat. Commun. 7, 12989 (2016).
doi: 10.1038/ncomms12989
Kehr, B. et al. Diversity in non-repetitive human sequences not found in the reference genome. Nat. Genet. 49, 588–593 (2017).
doi: 10.1038/ng.3801
Gordienko, E. N., Kazanov, M. D. & Gelfand, M. S. Evolution of pan-genomes of Escherichia coli, Shigella spp., and Salmonella enterica. J. Bacteriol. 195, 2786–2792 (2013).
doi: 10.1128/JB.02285-12
Tettelin, H. et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc. Natl Acad. Sci. USA 102, 13950–13955 (2005).
doi: 10.1073/pnas.0506758102
Vernikos, G., Medini, D., Riley, D. R. & Tettelin, H. Ten years of pan-genome analyses. Curr. Opin. Microbiol. 23, 148–154 (2015).
doi: 10.1016/j.mib.2014.11.016
Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl Acad. Sci. USA 113, 11901–11906 (2016).
doi: 10.1073/pnas.1613365113
Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).
doi: 10.1101/gr.214007.116
Mathias, R. A. et al. A continuum of admixture in the Western Hemisphere revealed by the African Diaspora genome. Nat. Commun. 7, 12522 (2016).
doi: 10.1038/ncomms12522
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
doi: 10.1186/1471-2105-10-421
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
doi: 10.1038/nature18964
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
doi: 10.1093/bioinformatics/btp324
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
doi: 10.1038/nmeth.1923
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
doi: 10.1093/bioinformatics/btp352
Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).
doi: 10.1093/bioinformatics/btt476
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
doi: 10.1101/gr.210641.116
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics Chapter 4, Unit 4.10, (2009).
Delcher, A. L., Salzberg, S. L. & Phillippy, A. M. Using MUMmer to identify similar regions in large sequence sets. Curr. Protoc. Bioinformatics Chapter 10, Unit 10.13, (2003).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
doi: 10.1093/bioinformatics/btq033
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome. Biol. 15, R46 (2014).
doi: 10.1186/gb-2014-15-3-r46