Assembly and annotation of an Ashkenazi human reference genome.
Journal
Genome biology
ISSN: 1474-760X
Titre abrégé: Genome Biol
Pays: England
ID NLM: 100960660
Informations de publication
Date de publication:
02 06 2020
02 06 2020
Historique:
received:
06
04
2020
accepted:
15
05
2020
entrez:
4
6
2020
pubmed:
4
6
2020
medline:
2
4
2021
Statut:
epublish
Résumé
Thousands of experiments and studies use the human reference genome as a resource each year. This single reference genome, GRCh38, is a mosaic created from a small number of individuals, representing a very small sample of the human population. There is a need for reference genomes from multiple human populations to avoid potential biases. Here, we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are > 99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. Forty of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. Eleven genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~ 1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes. The Ash1 genome is presented as a reference for any genetic studies involving Ashkenazi Jewish individuals.
Sections du résumé
BACKGROUND
Thousands of experiments and studies use the human reference genome as a resource each year. This single reference genome, GRCh38, is a mosaic created from a small number of individuals, representing a very small sample of the human population. There is a need for reference genomes from multiple human populations to avoid potential biases.
RESULTS
Here, we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are > 99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. Forty of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. Eleven genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~ 1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes.
CONCLUSIONS
The Ash1 genome is presented as a reference for any genetic studies involving Ashkenazi Jewish individuals.
Identifiants
pubmed: 32487205
doi: 10.1186/s13059-020-02047-7
pii: 10.1186/s13059-020-02047-7
pmc: PMC7265644
doi:
Types de publication
Comparative Study
Journal Article
Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.
Langues
eng
Sous-ensembles de citation
IM
Pagination
129Subventions
Organisme : NHGRI NIH HHS
ID : R01 HG006677
Pays : United States
Organisme : NIGMS NIH HHS
ID : R35 GM130151
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01-HG006677
Pays : United States
Organisme : NIGMS NIH HHS
ID : R35-GM130151
Pays : United States
Références
Genome Res. 2016 May;26(5):588-600
pubmed: 26941250
Nucleic Acids Res. 2006 May 08;34(8):2408-17
pubmed: 16682448
Bioinformatics. 2013 Nov 1;29(21):2669-77
pubmed: 23990416
PLoS One. 2015 Jul 06;10(7):e0132180
pubmed: 26147798
Nature. 2016 Oct 13;538(7624):243-247
pubmed: 27706134
Genome Biol. 2018 Nov 28;19(1):208
pubmed: 30486838
Nucleic Acids Res. 2019 Jul 2;47(W1):W636-W641
pubmed: 30976793
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
Bioinformatics. 2010 Mar 15;26(6):841-2
pubmed: 20110278
Nat Methods. 2018 Aug;15(8):595-597
pubmed: 30013044
Nature. 2020 May;581(7809):434-443
pubmed: 32461654
Nat Biotechnol. 2014 Mar;32(3):246-51
pubmed: 24531798
BMC Genomics. 2015 Apr 24;16:340
pubmed: 25903059
Genet Med. 2018 Mar;20(3):360-364
pubmed: 29155419
Trends Genet. 2009 Nov;25(11):489-94
pubmed: 19836853
Genes Genet Syst. 2018 Jan 20;92(3):135-152
pubmed: 29162774
Mol Syst Biol. 2005;1:2005.0030
pubmed: 16729065
Nat Commun. 2018 Aug 2;9(1):3040
pubmed: 30072691
Nat Methods. 2012 Mar 04;9(4):357-9
pubmed: 22388286
Genome Med. 2014 Feb 28;6(2):10
pubmed: 24713084
Nat Biotechnol. 2019 May;37(5):555-560
pubmed: 30858580
Science. 2010 May 7;328(5979):710-722
pubmed: 20448178
PLoS Comput Biol. 2018 Jan 26;14(1):e1005944
pubmed: 29373581
Nat Biotechnol. 2019 May;37(5):561-566
pubmed: 30936564
Genome Biol. 2019 Aug 9;20(1):159
pubmed: 31399121
Nature. 2016 Oct 12;538(7624):161-164
pubmed: 27734877
Nature. 2001 Feb 15;409(6822):860-921
pubmed: 11237011
BMC Bioinformatics. 2012 Sep 19;13:238
pubmed: 22988817
Sci Data. 2016 Jun 07;3:160025
pubmed: 27271295