A bioinformatic platform to integrate target capture and whole genome sequences of various read depths for phylogenomics.
secapr
de novo assembly
loci extraction
low-coverage whole genome sequencing
target sequence capture
Journal
Molecular ecology
ISSN: 1365-294X
Titre abrégé: Mol Ecol
Pays: England
ID NLM: 9214478
Informations de publication
Date de publication:
12 2021
12 2021
Historique:
revised:
24
09
2021
received:
30
11
2020
accepted:
16
10
2021
pubmed:
22
10
2021
medline:
29
1
2022
entrez:
21
10
2021
Statut:
ppublish
Résumé
The increasing availability of short-read whole genome sequencing (WGS) provides unprecedented opportunities to study ecological and evolutionary processes. Although loci of interest can be extracted from WGS data and combined with target sequence data, this requires suitable bioinformatic workflows. Here, we test different assembly and locus extraction strategies and implement them into secapr, a pipeline that processes short-read data into multilocus alignments for phylogenetics and molecular ecology analyses. We integrate the processing of data from low-coverage WGS (<30×) and target sequence capture into a flexible framework, while optimizing de novo contig assembly and loci extraction. Specifically, we test different assembly strategies by contrasting their ability to recover loci from targeted butterfly protein-coding genes, using four data sets: a WGS data set across different average coverages (10×, 5× and 2×) and a data set for which these loci were enriched prior to sequencing via target sequence capture. Using the resulting de novo contigs, we account for potential errors within contigs and infer phylogenetic trees to evaluate the ability of each assembly strategy to recover species relationships. We demonstrate that choosing multiple sizes of kmer simultaneously for assembly results in the highest yield of extracted loci from de novo assembled contigs, while data sets derived from sequencing read depths as low as 5× recovers the expected species relationships in phylogenetic trees. By making the tested assembly approaches available in the secapr pipeline, we hope to inspire future studies to incorporate complementary data and make an informed choice on the optimal assembly strategy.
Identifiants
pubmed: 34674330
doi: 10.1111/mec.16240
pmc: PMC9298010
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
6021-6035Subventions
Organisme : Swedish Research Council
ID : 2017-04980
Organisme : Swedish Research Council
ID : 2019-04739
Organisme : Swedish Foundation for Strategic Research
Organisme : Royal Botanic Gardens, Kew
Organisme : Grant Agency of the Czech Republic
ID : GJ20-18566Y
Organisme : Marie Skłodowska-Curie Fellowship of the European Commission
ID : MARIPOSAS-704035
Informations de copyright
© 2021 The Authors. Molecular Ecology published by John Wiley & Sons Ltd.
Références
Mol Ecol. 2021 Dec;30(23):5966-5993
pubmed: 34250668
Syst Biol. 2017 Sep 01;66(5):786-798
pubmed: 28123117
Mol Ecol Resour. 2019 Jul;19(4):877-892
pubmed: 30934146
Bioinformatics. 2014 Aug 1;30(15):2114-20
pubmed: 24695404
Bioinformatics. 2014 Jan 1;30(1):40-9
pubmed: 24130309
Curr Biol. 2018 Mar 5;28(5):770-778.e5
pubmed: 29456146
Mol Biol Evol. 2016 Jul;33(7):1654-68
pubmed: 27189547
Science. 2010 May 7;328(5979):723-5
pubmed: 20448179
Genome Res. 2009 Jun;19(6):1117-23
pubmed: 19251739
Genet Sel Evol. 2019 Aug 14;51(1):44
pubmed: 31412777
Syst Biol. 2019 Jan 1;68(1):32-46
pubmed: 29771371
Genome Res. 2017 May;27(5):768-777
pubmed: 28232478
Nucleic Acids Res. 2019 Jul 2;47(W1):W623-W631
pubmed: 31045209
Brief Bioinform. 2018 Jan 1;19(1):23-40
pubmed: 27742661
Nat Methods. 2007 Nov;4(11):903-5
pubmed: 17934467
Mol Biol Evol. 2013 Apr;30(4):772-80
pubmed: 23329690
BMC Genomics. 2017 May 22;18(1):396
pubmed: 28532386
Front Genet. 2020 Feb 21;10:1407
pubmed: 32153629
Algorithms Mol Biol. 2013 Sep 16;8(1):22
pubmed: 24040893
Proc Natl Acad Sci U S A. 2019 Mar 26;116(13):6232-6237
pubmed: 30877254
Genome Res. 2008 May;18(5):802-9
pubmed: 18332092
Science. 2014 Dec 12;346(6215):1320-31
pubmed: 25504713
Genome Res. 2003 Jan;13(1):103-7
pubmed: 12529312
Nat Biotechnol. 2009 Feb;27(2):182-9
pubmed: 19182786
Nat Microbiol. 2021 Jan;6(1):3-6
pubmed: 33349678
Genome Res. 2017 May;27(5):824-834
pubmed: 28298430
BMC Genomics. 2014 Jan 30;15:85
pubmed: 24479562
Mol Ecol Resour. 2020 Jul;20(4):892-905
pubmed: 32243090
Mol Biol Evol. 2020 May 1;37(5):1530-1534
pubmed: 32011700
J Comput Biol. 2012 May;19(5):455-77
pubmed: 22506599
PeerJ. 2018 Jul 13;6:e5175
pubmed: 30023140
Syst Biol. 2012 Oct;61(5):717-26
pubmed: 22232343
Mol Ecol. 2021 Dec;30(23):6021-6035
pubmed: 34674330
Mol Biol Evol. 2018 Feb 1;35(2):518-522
pubmed: 29077904
Curr Genomics. 2017 Aug;18(4):366-374
pubmed: 29081692
PLoS Comput Biol. 2016 Jun 16;12(6):e1004753
pubmed: 27308864
Genome Res. 2016 Sep;26(9):1257-67
pubmed: 27435933
Bioinformatics. 2013 Jan 1;29(1):84-91
pubmed: 23093610
Brief Funct Genomics. 2010 Dec;9(5-6):416-23
pubmed: 21266344
Bioinformatics. 2011 Feb 15;27(4):592-3
pubmed: 21169378
Nat Genet. 2021 Jan;53(1):120-126
pubmed: 33414550
Bioinformatics. 2016 Mar 1;32(5):786-8
pubmed: 26530724
Nat Methods. 2017 Jun;14(6):587-589
pubmed: 28481363
BMC Bioinformatics. 2018 May 8;19(Suppl 6):153
pubmed: 29745866
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712