RegEl corpus: identifying DNA regulatory elements in the scientific literature.
Journal
Database : the journal of biological databases and curation
ISSN: 1758-0463
Titre abrégé: Database (Oxford)
Pays: England
ID NLM: 101517697
Informations de publication
Date de publication:
27 06 2022
27 06 2022
Historique:
accepted:
02
06
2022
revised:
25
05
2022
received:
04
02
2022
entrez:
27
6
2022
pubmed:
28
6
2022
medline:
30
6
2022
Statut:
ppublish
Résumé
High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however, is dependent on the availability of annotated corpora. Therefore, we introduce RegEl (Regulatory Elements), the first freely available corpus annotated with regulatory DNA elements comprising 305 PubMed abstracts for a total of 2690 sentences. We focus on enhancers, promoters and transcription factor binding sites. Three annotators worked in two stages, achieving an overall 0.73 F1 inter-annotator agreement and 0.46 for regulatory elements. Depending on the entity type, IE baselines reach F1-scores of 0.48-0.91 for entity detection and 0.71-0.88 for entity normalization. Next, we apply our entity detection models to the entire PubMed collection and extract co-occurrences of genes or diseases with regulatory elements. This generates large collections of regulatory elements associated with 137 870 unique genes and 7420 diseases, which we make openly available. Database URL: https://zenodo.org/record/6418451#.YqcLHvexVqg.
Identifiants
pubmed: 35758881
pii: 6618549
doi: 10.1093/database/baac043
pmc: PMC9235371
pii:
doi:
Substances chimiques
DNA
9007-49-2
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2022. Published by Oxford University Press.
Références
Cancer Lett. 2013 Nov 1;340(2):284-95
pubmed: 23174106
Nucleic Acids Res. 2012 Jan;40(Database issue):D136-43
pubmed: 22139910
J Biomed Inform. 2021 Jun;118:103779
pubmed: 33839304
Bioinformatics. 2016 Sep 15;32(18):2883-5
pubmed: 27256315
Nucleic Acids Res. 2015 Jan;43(Database issue):D36-42
pubmed: 25355515
Annu Rev Genomics Hum Genet. 2017 Aug 31;18:45-63
pubmed: 28399667
Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593
pubmed: 31114887
Brief Bioinform. 2014 Mar;15(2):327-40
pubmed: 23255168
Bioinformatics. 2021 Jan 28;:
pubmed: 33508086
Nucleic Acids Res. 2001 Jan 1;29(1):308-11
pubmed: 11125122
Nat Genet. 2014 Feb;46(2):136-143
pubmed: 24413736
Nat Rev Genet. 2013 Dec;14(12):824
pubmed: 24145212
Bioinformatics. 2018 Jan 1;34(1):80-87
pubmed: 28968638
Cell. 2015 May 21;161(5):1012-1025
pubmed: 25959774
Nucleic Acids Res. 2008 Jan;36(Database issue):D13-21
pubmed: 18045790
PLoS One. 2012;7(6):e38460
pubmed: 22679507
Nucleic Acids Res. 2011 Jan;39(Database issue):D507-13
pubmed: 21030441
Nucleic Acids Res. 2021 Jan 8;49(D1):D884-D891
pubmed: 33137190
Nucleic Acids Res. 2018 Jan 4;46(D1):D78-D84
pubmed: 29059320
Nature. 2012 Sep 6;489(7414):57-74
pubmed: 22955616
Biomed Res Int. 2015;2015:918710
pubmed: 26380306
Bioinformatics. 2020 Feb 15;36(4):1234-1240
pubmed: 31501885
Sci Data. 2017 Aug 29;4:170112
pubmed: 28850106
Nat Rev Cancer. 2016 Aug;16(8):483-93
pubmed: 27364481
Nucleic Acids Res. 2020 Jan 8;48(D1):D51-D57
pubmed: 31665430
Science. 2012 Sep 7;337(6099):1190-5
pubmed: 22955828
Nucleic Acids Res. 2021 Jan 8;49(D1):D1046-D1057
pubmed: 33221922
BMC Bioinformatics. 2008 Sep 25;9:402
pubmed: 18817555