CSA: A high-throughput chromosome-scale assembly pipeline for vertebrate genomes.
chromosomes
comparative genomics
genome assembly
genome evolution
genome scaffolding
long-read
vertebrates
Journal
GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872
Informations de publication
Date de publication:
01 05 2020
01 05 2020
Historique:
received:
31
10
2019
revised:
29
01
2020
accepted:
24
03
2020
entrez:
26
5
2020
pubmed:
26
5
2020
medline:
5
10
2021
Statut:
ppublish
Résumé
Easy-to-use and fast bioinformatics pipelines for long-read assembly that go beyond the contig level to generate highly continuous chromosome-scale genomes from raw data remain scarce. Chromosome-Scale Assembler (CSA) is a novel computationally highly efficient bioinformatics pipeline that fills this gap. CSA integrates information from scaffolded assemblies (e.g., Hi-C or 10X Genomics) or even from diverged reference genomes into the assembly process. As CSA performs automated assembly of chromosome-sized scaffolds, we benchmark its performance against state-of-the-art reference genomes, i.e., conventionally built in a laborious fashion using multiple separate assembly tools and manual curation. CSA increases the contig lengths using scaffolding, local re-assembly, and gap closing. On certain datasets, initial contig N50 may be increased up to 4.5-fold. For smaller vertebrate genomes, chromosome-scale assemblies can be achieved within 12 h using low-cost, high-end desktop computers. Mammalian genomes can be processed within 16 h on compute-servers. Using diverged reference genomes for fish, birds, and mammals, we demonstrate that CSA calculates chromosome-scale assemblies from long-read data and genome comparisons alone. Even contig-level draft assemblies of diverged genomes are helpful for reconstructing chromosome-scale sequences. CSA is also capable of assembling ultra-long reads. CSA can speed up and simplify chromosome-level assembly and significantly lower costs of large-scale family-level vertebrate genome projects.
Sections du résumé
BACKGROUND
Easy-to-use and fast bioinformatics pipelines for long-read assembly that go beyond the contig level to generate highly continuous chromosome-scale genomes from raw data remain scarce.
RESULT
Chromosome-Scale Assembler (CSA) is a novel computationally highly efficient bioinformatics pipeline that fills this gap. CSA integrates information from scaffolded assemblies (e.g., Hi-C or 10X Genomics) or even from diverged reference genomes into the assembly process. As CSA performs automated assembly of chromosome-sized scaffolds, we benchmark its performance against state-of-the-art reference genomes, i.e., conventionally built in a laborious fashion using multiple separate assembly tools and manual curation. CSA increases the contig lengths using scaffolding, local re-assembly, and gap closing. On certain datasets, initial contig N50 may be increased up to 4.5-fold. For smaller vertebrate genomes, chromosome-scale assemblies can be achieved within 12 h using low-cost, high-end desktop computers. Mammalian genomes can be processed within 16 h on compute-servers. Using diverged reference genomes for fish, birds, and mammals, we demonstrate that CSA calculates chromosome-scale assemblies from long-read data and genome comparisons alone. Even contig-level draft assemblies of diverged genomes are helpful for reconstructing chromosome-scale sequences. CSA is also capable of assembling ultra-long reads.
CONCLUSIONS
CSA can speed up and simplify chromosome-level assembly and significantly lower costs of large-scale family-level vertebrate genome projects.
Identifiants
pubmed: 32449778
pii: 5843736
doi: 10.1093/gigascience/giaa034
pmc: PMC7247394
pii:
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2020. Published by Oxford University Press.
Références
Genome Res. 2019 Apr;29(4):576-589
pubmed: 30760546
Gigascience. 2019 Apr 1;8(4):
pubmed: 30821816
Genome Biol. 2004;5(2):R12
pubmed: 14759262
G3 (Bethesda). 2018 Dec 10;8(12):3737-3743
pubmed: 30355765
Bioinformatics. 2016 Jul 15;32(14):2103-10
pubmed: 27153593
Nat Biotechnol. 2019 May;37(5):540-546
pubmed: 30936562
Nature. 2018 Feb 1;554(7690):50-55
pubmed: 29364872
Bioinformatics. 2014 Jun 15;30(12):i302-9
pubmed: 24931998
PLoS One. 2019 Aug 27;14(8):e0221858
pubmed: 31454399
Annu Rev Anim Biosci. 2015;3:57-111
pubmed: 25689317
PLoS Genet. 2016 Apr 15;12(4):e1005954
pubmed: 27082250
Genome Res. 2011 Aug;21(8):1306-12
pubmed: 21482624
Proc Natl Acad Sci U S A. 2019 Feb 5;116(6):2165-2174
pubmed: 30674676
Science. 2016 Apr 1;352(6281):aae0344
pubmed: 27034376
Nat Genet. 2016 Apr;48(4):427-37
pubmed: 26950095
Annu Rev Anim Biosci. 2018 Feb 15;6:47-68
pubmed: 29447475
BMC Evol Biol. 2014 Dec 12;14:250
pubmed: 25527260
Proc Natl Acad Sci U S A. 2013 Jan 29;110(5):1785-90
pubmed: 23307812
Gigascience. 2012 Dec 27;1(1):18
pubmed: 23587118
Heredity (Edinb). 2012 Jan;108(1):28-36
pubmed: 22108627
BMC Genomics. 2017 Jul 12;18(1):527
pubmed: 28701198
Dev Biol. 2017 Jun 15;426(2):211-218
pubmed: 27265323
Gigascience. 2015 Mar 18;4:10
pubmed: 25789164
BMC Genomics. 2015;16 Suppl 10:S11
pubmed: 26450761
Genome Res. 2018 Nov;28(11):1720-1732
pubmed: 30341161
Bioinformatics. 2015 Oct 1;31(19):3210-2
pubmed: 26059717
Genome Res. 2012 Dec;22(12):2356-67
pubmed: 22722344
Bioessays. 2005 Sep;27(9):937-45
pubmed: 16108068
Genome Res. 2017 May;27(5):722-736
pubmed: 28298431
Front Genet. 2015 Jun 19;6:220
pubmed: 26150829
Gigascience. 2017 Oct 1;6(10):1-16
pubmed: 29020750
Genome Res. 2003 Sep;13(9):2164-70
pubmed: 12952883
Proc Natl Acad Sci U S A. 2015 Mar 17;112(11):E1257-62
pubmed: 25733869
Nature. 2005 Sep 15;437(7057):376-80
pubmed: 16056220
Nat Genet. 1999 Mar;21(3):258-9
pubmed: 10080173
Chromosoma. 2012 Aug;121(4):409-18
pubmed: 22619043
Genome Biol. 2018 Nov 23;19(1):201
pubmed: 30470246
Bioinformatics. 2015 Aug 1;31(15):2443-51
pubmed: 25810435
Genome Res. 2019 Feb;29(2):317-324
pubmed: 30679309
Gigascience. 2019 Jan 1;8(1):
pubmed: 30576505
Mol Ecol Resour. 2020 Mar;20(2):531-543
pubmed: 31903688
Genome Res. 2002 Jan;12(1):177-89
pubmed: 11779843
PLoS Comput Biol. 2019 Jun 5;15(6):e1006994
pubmed: 31166948
Nat Methods. 2020 Feb;17(2):155-158
pubmed: 31819265
Genome Res. 2009 Aug;19(8):1497-505
pubmed: 19465509
Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8
pubmed: 21187386
Nat Rev Genet. 2005 Sep;6(9):699-708
pubmed: 16151375
Nat Ecol Evol. 2019 Sep;3(9):1289-1293
pubmed: 31383947
BMC Genomics. 2017 Dec 6;18(Suppl 10):879
pubmed: 29244003
Nat Commun. 2019 Dec 5;10(1):5551
pubmed: 31804492
BMC Bioinformatics. 2016 Oct 28;17(1):435
pubmed: 27793084
PLoS One. 2012;7(11):e47768
pubmed: 23185243
Genome Biol. 2015 May 21;16:106
pubmed: 25994148
PLoS One. 2014 Nov 19;9(11):e112963
pubmed: 25409509
Nat Methods. 2016 Dec;13(12):1050-1054
pubmed: 27749838
Nat Biotechnol. 2013 Dec;31(12):1119-25
pubmed: 24185095
Genome Biol. 2018 Oct 17;19(1):166
pubmed: 30333059
Genome Inform. 2006;17(2):152-61
pubmed: 17503388
Proc Natl Acad Sci U S A. 2018 Apr 24;115(17):4325-4333
pubmed: 29686065
Science. 2010 Apr 30;328(5978):633-6
pubmed: 20431018
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
Cytogenet Genome Res. 2019;157(1-2):7-20
pubmed: 30645998
Science. 2000 Mar 24;287(5461):2196-204
pubmed: 10731133
Proc Natl Acad Sci U S A. 2017 Feb 21;114(8):E1460-E1469
pubmed: 28179571