CSA: A high-throughput chromosome-scale assembly pipeline for vertebrate genomes.


Journal

GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872

Informations de publication

Date de publication:
01 05 2020
Historique:
received: 31 10 2019
revised: 29 01 2020
accepted: 24 03 2020
entrez: 26 5 2020
pubmed: 26 5 2020
medline: 5 10 2021
Statut: ppublish

Résumé

Easy-to-use and fast bioinformatics pipelines for long-read assembly that go beyond the contig level to generate highly continuous chromosome-scale genomes from raw data remain scarce. Chromosome-Scale Assembler (CSA) is a novel computationally highly efficient bioinformatics pipeline that fills this gap. CSA integrates information from scaffolded assemblies (e.g., Hi-C or 10X Genomics) or even from diverged reference genomes into the assembly process. As CSA performs automated assembly of chromosome-sized scaffolds, we benchmark its performance against state-of-the-art reference genomes, i.e., conventionally built in a laborious fashion using multiple separate assembly tools and manual curation. CSA increases the contig lengths using scaffolding, local re-assembly, and gap closing. On certain datasets, initial contig N50 may be increased up to 4.5-fold. For smaller vertebrate genomes, chromosome-scale assemblies can be achieved within 12 h using low-cost, high-end desktop computers. Mammalian genomes can be processed within 16 h on compute-servers. Using diverged reference genomes for fish, birds, and mammals, we demonstrate that CSA calculates chromosome-scale assemblies from long-read data and genome comparisons alone. Even contig-level draft assemblies of diverged genomes are helpful for reconstructing chromosome-scale sequences. CSA is also capable of assembling ultra-long reads. CSA can speed up and simplify chromosome-level assembly and significantly lower costs of large-scale family-level vertebrate genome projects.

Sections du résumé

BACKGROUND
Easy-to-use and fast bioinformatics pipelines for long-read assembly that go beyond the contig level to generate highly continuous chromosome-scale genomes from raw data remain scarce.
RESULT
Chromosome-Scale Assembler (CSA) is a novel computationally highly efficient bioinformatics pipeline that fills this gap. CSA integrates information from scaffolded assemblies (e.g., Hi-C or 10X Genomics) or even from diverged reference genomes into the assembly process. As CSA performs automated assembly of chromosome-sized scaffolds, we benchmark its performance against state-of-the-art reference genomes, i.e., conventionally built in a laborious fashion using multiple separate assembly tools and manual curation. CSA increases the contig lengths using scaffolding, local re-assembly, and gap closing. On certain datasets, initial contig N50 may be increased up to 4.5-fold. For smaller vertebrate genomes, chromosome-scale assemblies can be achieved within 12 h using low-cost, high-end desktop computers. Mammalian genomes can be processed within 16 h on compute-servers. Using diverged reference genomes for fish, birds, and mammals, we demonstrate that CSA calculates chromosome-scale assemblies from long-read data and genome comparisons alone. Even contig-level draft assemblies of diverged genomes are helpful for reconstructing chromosome-scale sequences. CSA is also capable of assembling ultra-long reads.
CONCLUSIONS
CSA can speed up and simplify chromosome-level assembly and significantly lower costs of large-scale family-level vertebrate genome projects.

Identifiants

pubmed: 32449778
pii: 5843736
doi: 10.1093/gigascience/giaa034
pmc: PMC7247394
pii:
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2020. Published by Oxford University Press.

Références

Genome Res. 2019 Apr;29(4):576-589
pubmed: 30760546
Gigascience. 2019 Apr 1;8(4):
pubmed: 30821816
Genome Biol. 2004;5(2):R12
pubmed: 14759262
G3 (Bethesda). 2018 Dec 10;8(12):3737-3743
pubmed: 30355765
Bioinformatics. 2016 Jul 15;32(14):2103-10
pubmed: 27153593
Nat Biotechnol. 2019 May;37(5):540-546
pubmed: 30936562
Nature. 2018 Feb 1;554(7690):50-55
pubmed: 29364872
Bioinformatics. 2014 Jun 15;30(12):i302-9
pubmed: 24931998
PLoS One. 2019 Aug 27;14(8):e0221858
pubmed: 31454399
Annu Rev Anim Biosci. 2015;3:57-111
pubmed: 25689317
PLoS Genet. 2016 Apr 15;12(4):e1005954
pubmed: 27082250
Genome Res. 2011 Aug;21(8):1306-12
pubmed: 21482624
Proc Natl Acad Sci U S A. 2019 Feb 5;116(6):2165-2174
pubmed: 30674676
Science. 2016 Apr 1;352(6281):aae0344
pubmed: 27034376
Nat Genet. 2016 Apr;48(4):427-37
pubmed: 26950095
Annu Rev Anim Biosci. 2018 Feb 15;6:47-68
pubmed: 29447475
BMC Evol Biol. 2014 Dec 12;14:250
pubmed: 25527260
Proc Natl Acad Sci U S A. 2013 Jan 29;110(5):1785-90
pubmed: 23307812
Gigascience. 2012 Dec 27;1(1):18
pubmed: 23587118
Heredity (Edinb). 2012 Jan;108(1):28-36
pubmed: 22108627
BMC Genomics. 2017 Jul 12;18(1):527
pubmed: 28701198
Dev Biol. 2017 Jun 15;426(2):211-218
pubmed: 27265323
Gigascience. 2015 Mar 18;4:10
pubmed: 25789164
BMC Genomics. 2015;16 Suppl 10:S11
pubmed: 26450761
Genome Res. 2018 Nov;28(11):1720-1732
pubmed: 30341161
Bioinformatics. 2015 Oct 1;31(19):3210-2
pubmed: 26059717
Genome Res. 2012 Dec;22(12):2356-67
pubmed: 22722344
Bioessays. 2005 Sep;27(9):937-45
pubmed: 16108068
Genome Res. 2017 May;27(5):722-736
pubmed: 28298431
Front Genet. 2015 Jun 19;6:220
pubmed: 26150829
Gigascience. 2017 Oct 1;6(10):1-16
pubmed: 29020750
Genome Res. 2003 Sep;13(9):2164-70
pubmed: 12952883
Proc Natl Acad Sci U S A. 2015 Mar 17;112(11):E1257-62
pubmed: 25733869
Nature. 2005 Sep 15;437(7057):376-80
pubmed: 16056220
Nat Genet. 1999 Mar;21(3):258-9
pubmed: 10080173
Chromosoma. 2012 Aug;121(4):409-18
pubmed: 22619043
Genome Biol. 2018 Nov 23;19(1):201
pubmed: 30470246
Bioinformatics. 2015 Aug 1;31(15):2443-51
pubmed: 25810435
Genome Res. 2019 Feb;29(2):317-324
pubmed: 30679309
Gigascience. 2019 Jan 1;8(1):
pubmed: 30576505
Mol Ecol Resour. 2020 Mar;20(2):531-543
pubmed: 31903688
Genome Res. 2002 Jan;12(1):177-89
pubmed: 11779843
PLoS Comput Biol. 2019 Jun 5;15(6):e1006994
pubmed: 31166948
Nat Methods. 2020 Feb;17(2):155-158
pubmed: 31819265
Genome Res. 2009 Aug;19(8):1497-505
pubmed: 19465509
Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8
pubmed: 21187386
Nat Rev Genet. 2005 Sep;6(9):699-708
pubmed: 16151375
Nat Ecol Evol. 2019 Sep;3(9):1289-1293
pubmed: 31383947
BMC Genomics. 2017 Dec 6;18(Suppl 10):879
pubmed: 29244003
Nat Commun. 2019 Dec 5;10(1):5551
pubmed: 31804492
BMC Bioinformatics. 2016 Oct 28;17(1):435
pubmed: 27793084
PLoS One. 2012;7(11):e47768
pubmed: 23185243
Genome Biol. 2015 May 21;16:106
pubmed: 25994148
PLoS One. 2014 Nov 19;9(11):e112963
pubmed: 25409509
Nat Methods. 2016 Dec;13(12):1050-1054
pubmed: 27749838
Nat Biotechnol. 2013 Dec;31(12):1119-25
pubmed: 24185095
Genome Biol. 2018 Oct 17;19(1):166
pubmed: 30333059
Genome Inform. 2006;17(2):152-61
pubmed: 17503388
Proc Natl Acad Sci U S A. 2018 Apr 24;115(17):4325-4333
pubmed: 29686065
Science. 2010 Apr 30;328(5978):633-6
pubmed: 20431018
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
Cytogenet Genome Res. 2019;157(1-2):7-20
pubmed: 30645998
Science. 2000 Mar 24;287(5461):2196-204
pubmed: 10731133
Proc Natl Acad Sci U S A. 2017 Feb 21;114(8):E1460-E1469
pubmed: 28179571

Auteurs

Heiner Kuhl (H)

Department of Ecophysiology and Aquaculture, Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB), Müggelseedamm 310, 12587 Berlin, Germany.

Ling Li (L)

Department of Ecophysiology and Aquaculture, Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB), Müggelseedamm 310, 12587 Berlin, Germany.
College of Fisheries, Chinese Perch Research Center, Huazhong Agricultural University; Innovation Base for Chinese Perch Breeding, Key Lab of Freshwater Animal Breeding, Ministry of Agriculture, No.1 Shizishan Street, Hongshan District, 430070 Wuhan, Hubei Province, P.R. China.

Sven Wuertz (S)

Department of Ecophysiology and Aquaculture, Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB), Müggelseedamm 310, 12587 Berlin, Germany.

Matthias Stöck (M)

Department of Ecophysiology and Aquaculture, Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB), Müggelseedamm 310, 12587 Berlin, Germany.

Xu-Fang Liang (XF)

College of Fisheries, Chinese Perch Research Center, Huazhong Agricultural University; Innovation Base for Chinese Perch Breeding, Key Lab of Freshwater Animal Breeding, Ministry of Agriculture, No.1 Shizishan Street, Hongshan District, 430070 Wuhan, Hubei Province, P.R. China.

Christophe Klopp (C)

Sigenae, Bioinfo Genotoul, Mathématiques et Informatique Appliquées de Toulouse, INRAe, 24 Chemin de Borde Rouge, 31320 Auzeville-Tolosane, Castanet Tolosan, France.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing
Robotic Surgical Procedures Animals Humans Telemedicine Models, Animal

Odour generalisation and detection dog training.

Lyn Caldicott, Thomas W Pike, Helen E Zulch et al.
1.00
Animals Odorants Dogs Generalization, Psychological Smell

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Classifications MeSH