K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes.


Journal

The plant genome
ISSN: 1940-3372
Titre abrégé: Plant Genome
Pays: United States
ID NLM: 101273919

Informations de publication

Date de publication:
11 2021
Historique:
received: 07 04 2021
accepted: 06 07 2021
pubmed: 26 9 2021
medline: 29 3 2022
entrez: 25 9 2021
Statut: ppublish

Résumé

The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.

Identifiants

pubmed: 34562304
doi: 10.1002/tpg2.20143
pmc: PMC7614178
mid: EMS164607
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Pagination

e20143

Subventions

Organisme : Wellcome Trust
ID : 108749
Pays : United Kingdom
Organisme : Biotechnology and Biological Sciences Research Council
ID : BB/P016855/1
Pays : United Kingdom
Organisme : Biotechnology and Biological Sciences Research Council
ID : BB/P027849/1
Pays : United Kingdom

Informations de copyright

© 2021 The Authors. The Plant Genome published by Wiley Periodicals LLC on behalf of Crop Science Society of America.

Références

Hortic Res. 2018 Aug 15;5:50
pubmed: 30131865
BMC Bioinformatics. 2015 Jul 24;16:227
pubmed: 26206263
Sci Rep. 2018 May 24;8(1):8088
pubmed: 29795526
Plant J. 2020 May;102(3):631-642
pubmed: 31823436
Nucleic Acids Res. 1999 Jan 15;27(2):573-80
pubmed: 9862982
Curr Opin Plant Biol. 2019 Apr;48:9-17
pubmed: 30797187
Plant Genome. 2020 Mar;13(1):e20010
pubmed: 33016633
Nucleic Acids Res. 2018 Jan 4;46(D1):D1197-D1201
pubmed: 29156057
Genome Biol. 2017 Jul 7;18(1):134
pubmed: 28687080
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
BMC Genomics. 2013 Oct 06;14:686
pubmed: 24093210
Plant J. 2020 Jan;101(2):455-472
pubmed: 31529539
Nature. 2007 Sep 27;449(7161):463-7
pubmed: 17721507
Sci Rep. 2015 Nov 30;5:17394
pubmed: 26617401
Mob DNA. 2015 Jun 02;6:11
pubmed: 26045719
Nat Plants. 2017 Dec;3(12):946-955
pubmed: 29158546
Mob DNA. 2019 Jul 17;10:30
pubmed: 31346350
BMC Genomics. 2010 Feb 17;11:113
pubmed: 20163715
Science. 2009 Nov 20;326(5956):1112-5
pubmed: 19965430
Nature. 2020 Dec;588(7837):277-283
pubmed: 33239791
Front Plant Sci. 2017 Feb 14;8:184
pubmed: 28261241
Nat Plants. 2020 Nov;6(11):1325-1329
pubmed: 33077876
J Comput Biol. 2006 Jun;13(5):1028-40
pubmed: 16796549
Plant Physiol. 2021 Apr 2;185(3):1242-1258
pubmed: 33744946
Nucleic Acids Res. 2013 Jan;41(Database issue):D83-9
pubmed: 23203982
Nature. 2010 Feb 11;463(7282):763-8
pubmed: 20148030
Nature. 2017 Jun 1;546(7656):148-152
pubmed: 28538728
Sci Data. 2016 Mar 15;3:160018
pubmed: 26978244
Mob DNA. 2019 Jan 29;10:6
pubmed: 30719103
Nucleic Acids Res. 2013 Jan;41(Database issue):D1144-51
pubmed: 23203886
G3 (Bethesda). 2020 Oct 5;10(10):3467-3478
pubmed: 32694197
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515
pubmed: 30395287
Proc Natl Acad Sci U S A. 2020 Apr 28;117(17):9451-9457
pubmed: 32300014
Nat Commun. 2017 Dec 19;8(1):2184
pubmed: 29259172
Nat Genet. 2018 Jun;50(6):772-777
pubmed: 29713014
Nat Genet. 2013 Jan;45(1):51-8
pubmed: 23179023
Genome Biol. 2020 Feb 12;21(1):35
pubmed: 32051000
Nat Genet. 2019 May;51(5):885-895
pubmed: 30962619
Nat Commun. 2014 Apr 23;5:3706
pubmed: 24759634
Bioinformatics. 1998;14(9):755-63
pubmed: 9918945
Plant J. 2012 Oct;72(1):142-53
pubmed: 22691070
Nucleic Acids Res. 2020 Jan 8;48(D1):D689-D695
pubmed: 31598706
Nucleic Acids Res. 2021 Jan 8;49(D1):D412-D419
pubmed: 33125078
Proc Natl Acad Sci U S A. 2017 Oct 31;114(44):E9413-E9422
pubmed: 29078332
Nat Plants. 2018 Oct;4(10):762-765
pubmed: 30287950
Genome Biol. 2019 Dec 16;20(1):275
pubmed: 31843001
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402
pubmed: 9254694
BMC Genomics. 2008 Oct 31;9:517
pubmed: 18976482
Front Plant Sci. 2020 Jan 31;10:1815
pubmed: 32076428
Bioinformatics. 2011 Mar 15;27(6):764-70
pubmed: 21217122
Nat Genet. 2011 Sep 25;43(11):1160-3
pubmed: 21946354
Proc Natl Acad Sci U S A. 1950 Jun;36(6):344-55
pubmed: 15430309
Bioinformatics. 2010 Mar 15;26(6):841-2
pubmed: 20110278
Genome Res. 2004 May;14(5):929-33
pubmed: 15123588
Heredity (Edinb). 2010 Jun;104(6):520-33
pubmed: 19935826
Nat Genet. 2017 Jul;49(7):1099-1106
pubmed: 28581499
Nat Rev Genet. 2007 Dec;8(12):973-82
pubmed: 17984973
Nat Plants. 2015 Feb 02;1:14023
pubmed: 27246759
Brief Bioinform. 2021 May 20;22(3):
pubmed: 34020551
Database (Oxford). 2011 Jul 23;2011:bar030
pubmed: 21785142
Nature. 2005 Aug 11;436(7052):793-800
pubmed: 16100779
Plant Physiol. 2020 Jun;183(2):468-482
pubmed: 32184345
Nat Plants. 2018 Jul;4(7):473-484
pubmed: 29892093

Auteurs

Bruno Contreras-Moreira (B)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

Carla V Filippi (CV)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
Instituto de Biotecnología, Centro de Investigaciones en Ciencias Veterinarias y Agronómicas (CICVyA), Instituto Nacional de Tecnología Agropecuaria (INTA); Instituto de Agrobiotecnología y Biología Molecular (IABIMO), INTA-Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) Nicolas Repetto y Los Reseros s/n (1686), Hurlingham, Buenos Aires, Argentina.
CONICET, Av Rivadavia 1917, C1033AAJ Ciudad de Buenos Aires, Argentina.

Guy Naamati (G)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

Carlos García Girón (C)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

James E Allen (JE)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

Paul Flicek (P)

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

Articles similaires

Genome Size Genome, Plant Magnoliopsida Evolution, Molecular Arabidopsis
Genome, Plant Medicago sativa Crops, Agricultural Genomics Polyploidy
Arabidopsis Amorphophallus Plants, Genetically Modified Phylogeny Droughts
Ascomycota Cenchrus Chromosomes, Fungal Genome, Fungal Plant Diseases

Classifications MeSH