Improving reusability along the data life cycle: a regulatory circuits case study.
Bioinformatics
Dataset architecture
Linked Open Data
RDF named graphs
Reusability
SPARQL
Journal
Journal of biomedical semantics
ISSN: 2041-1480
Titre abrégé: J Biomed Semantics
Pays: England
ID NLM: 101531992
Informations de publication
Date de publication:
28 03 2022
28 03 2022
Historique:
received:
02
07
2021
accepted:
07
03
2022
entrez:
29
3
2022
pubmed:
30
3
2022
medline:
5
4
2022
Statut:
epublish
Résumé
In life sciences, there has been a long-standing effort of standardization and integration of reference datasets and databases. Despite these efforts, many studies data are provided using specific and non-standard formats. This hampers the capacity to reuse the studies data in other pipelines, the capacity to reuse the pipelines results in other studies, and the capacity to enrich the data with additional information. The Regulatory Circuits project is one of the largest efforts for integrating human cell genomics data to predict tissue-specific transcription factor-genes interaction networks. In spite of its success, it exhibits the usual shortcomings limiting its update, its reuse (as a whole or partially), and its extension with new data samples. To address these limitations, the resource has previously been integrated in an RDF triplestore so that TF-gene interaction networks could be generated with two SPARQL queries. However, this triplestore did not store the computed networks and did not integrate metadata about tissues and samples, therefore limiting the reuse of this dataset. In particular, it does not enable to reuse only a portion of Regulatory Circuits if a study focuses on a subset of the tissues, nor to combine the samples described in the datasets with samples from other studies. Overall, these limitations advocate for the design of a complete, flexible and reusable representation of the Regulatory Circuits dataset based on Semantic Web technologies. We provide a modular RDF representation of the Regulatory Circuits, called Linked Extended Regulatory Circuits (LERC). It consists in (i) descriptions of biological and experimental context mapped to the references databases, (ii) annotations about TF-gene interactions at the sample level for 808 samples, (iii) annotations about TF-gene interactions at the tissue level for 394 tissues, (iv) metadata connecting the knowledge graphs cited above. LERC is based on a modular organisation into 1,205 RDF named graphs for representing the biological data, the sample-specific and the tissue-specific networks, and the corresponding metadata. In total it contains 3,910,794,050 triples and is available as a SPARQL endpoint. The flexible and modular architecture of LERC supports biologically-relevant SPARQL queries. It allows an easy and fast querying of the resources related to the initial Regulatory Circuits datasets and facilitates its reuse in other studies. ASSOCIATED WEBSITE: https://regulatorycircuits-lod.genouest.org.
Sections du résumé
BACKGROUND
In life sciences, there has been a long-standing effort of standardization and integration of reference datasets and databases. Despite these efforts, many studies data are provided using specific and non-standard formats. This hampers the capacity to reuse the studies data in other pipelines, the capacity to reuse the pipelines results in other studies, and the capacity to enrich the data with additional information. The Regulatory Circuits project is one of the largest efforts for integrating human cell genomics data to predict tissue-specific transcription factor-genes interaction networks. In spite of its success, it exhibits the usual shortcomings limiting its update, its reuse (as a whole or partially), and its extension with new data samples. To address these limitations, the resource has previously been integrated in an RDF triplestore so that TF-gene interaction networks could be generated with two SPARQL queries. However, this triplestore did not store the computed networks and did not integrate metadata about tissues and samples, therefore limiting the reuse of this dataset. In particular, it does not enable to reuse only a portion of Regulatory Circuits if a study focuses on a subset of the tissues, nor to combine the samples described in the datasets with samples from other studies. Overall, these limitations advocate for the design of a complete, flexible and reusable representation of the Regulatory Circuits dataset based on Semantic Web technologies.
RESULTS
We provide a modular RDF representation of the Regulatory Circuits, called Linked Extended Regulatory Circuits (LERC). It consists in (i) descriptions of biological and experimental context mapped to the references databases, (ii) annotations about TF-gene interactions at the sample level for 808 samples, (iii) annotations about TF-gene interactions at the tissue level for 394 tissues, (iv) metadata connecting the knowledge graphs cited above. LERC is based on a modular organisation into 1,205 RDF named graphs for representing the biological data, the sample-specific and the tissue-specific networks, and the corresponding metadata. In total it contains 3,910,794,050 triples and is available as a SPARQL endpoint.
CONCLUSION
The flexible and modular architecture of LERC supports biologically-relevant SPARQL queries. It allows an easy and fast querying of the resources related to the initial Regulatory Circuits datasets and facilitates its reuse in other studies. ASSOCIATED WEBSITE: https://regulatorycircuits-lod.genouest.org.
Identifiants
pubmed: 35346379
doi: 10.1186/s13326-022-00266-4
pii: 10.1186/s13326-022-00266-4
pmc: PMC8962212
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
11Informations de copyright
© 2022. The Author(s).
Références
PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
IEEE J Biomed Health Inform. 2017 Nov 29;22(5):1672-1683
pubmed: 29990071
PLoS Comput Biol. 2005 Dec;1(7):e76
pubmed: 16738704
Database (Oxford). 2016 Jul 09;2016:
pubmed: 27402679
Genome Biol. 2015 Jan 05;16:22
pubmed: 25723102
Science. 1993 Oct 22;262(5133):502-3
pubmed: 8211171
ScientificWorldJournal. 2012;2012:435257
pubmed: 23365519
Nucleic Acids Res. 2021 Jan 8;49(D1):D884-D891
pubmed: 33137190
J Biomed Inform. 2006 Jun;39(3):314-20
pubmed: 16564748
Nature. 2012 Sep 6;489(7414):57-74
pubmed: 22955616
Nucleic Acids Res. 2018 Jan 4;46(D1):D380-D386
pubmed: 29087512
NPJ Digit Med. 2019 Sep 10;2:90
pubmed: 31531395
Nat Rev Genet. 2003 May;4(5):337-45
pubmed: 12728276
Nature. 2014 Mar 27;507(7493):455-461
pubmed: 24670763
Brief Bioinform. 2009 Jul;10(4):392-407
pubmed: 19457869
Sci Data. 2016 Mar 15;3:160018
pubmed: 26978244
Nucleic Acids Res. 2020 Jan 8;48(D1):D504-D510
pubmed: 31665520
Brief Bioinform. 2013 Jan;14(1):109-25
pubmed: 22492191
Nat Methods. 2016 Apr;13(4):366-70
pubmed: 26950747
Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489
pubmed: 33237286