Improving reusability along the data life cycle: a regulatory circuits case study.

Bioinformatics Dataset architecture Linked Open Data RDF named graphs Reusability SPARQL

Journal

Journal of biomedical semantics
ISSN: 2041-1480
Titre abrégé: J Biomed Semantics
Pays: England
ID NLM: 101531992

Informations de publication

Date de publication:
28 03 2022
Historique:
received: 02 07 2021
accepted: 07 03 2022
entrez: 29 3 2022
pubmed: 30 3 2022
medline: 5 4 2022
Statut: epublish

Résumé

In life sciences, there has been a long-standing effort of standardization and integration of reference datasets and databases. Despite these efforts, many studies data are provided using specific and non-standard formats. This hampers the capacity to reuse the studies data in other pipelines, the capacity to reuse the pipelines results in other studies, and the capacity to enrich the data with additional information. The Regulatory Circuits project is one of the largest efforts for integrating human cell genomics data to predict tissue-specific transcription factor-genes interaction networks. In spite of its success, it exhibits the usual shortcomings limiting its update, its reuse (as a whole or partially), and its extension with new data samples. To address these limitations, the resource has previously been integrated in an RDF triplestore so that TF-gene interaction networks could be generated with two SPARQL queries. However, this triplestore did not store the computed networks and did not integrate metadata about tissues and samples, therefore limiting the reuse of this dataset. In particular, it does not enable to reuse only a portion of Regulatory Circuits if a study focuses on a subset of the tissues, nor to combine the samples described in the datasets with samples from other studies. Overall, these limitations advocate for the design of a complete, flexible and reusable representation of the Regulatory Circuits dataset based on Semantic Web technologies. We provide a modular RDF representation of the Regulatory Circuits, called Linked Extended Regulatory Circuits (LERC). It consists in (i) descriptions of biological and experimental context mapped to the references databases, (ii) annotations about TF-gene interactions at the sample level for 808 samples, (iii) annotations about TF-gene interactions at the tissue level for 394 tissues, (iv) metadata connecting the knowledge graphs cited above. LERC is based on a modular organisation into 1,205 RDF named graphs for representing the biological data, the sample-specific and the tissue-specific networks, and the corresponding metadata. In total it contains 3,910,794,050 triples and is available as a SPARQL endpoint. The flexible and modular architecture of LERC supports biologically-relevant SPARQL queries. It allows an easy and fast querying of the resources related to the initial Regulatory Circuits datasets and facilitates its reuse in other studies. ASSOCIATED WEBSITE: https://regulatorycircuits-lod.genouest.org.

Sections du résumé

BACKGROUND
In life sciences, there has been a long-standing effort of standardization and integration of reference datasets and databases. Despite these efforts, many studies data are provided using specific and non-standard formats. This hampers the capacity to reuse the studies data in other pipelines, the capacity to reuse the pipelines results in other studies, and the capacity to enrich the data with additional information. The Regulatory Circuits project is one of the largest efforts for integrating human cell genomics data to predict tissue-specific transcription factor-genes interaction networks. In spite of its success, it exhibits the usual shortcomings limiting its update, its reuse (as a whole or partially), and its extension with new data samples. To address these limitations, the resource has previously been integrated in an RDF triplestore so that TF-gene interaction networks could be generated with two SPARQL queries. However, this triplestore did not store the computed networks and did not integrate metadata about tissues and samples, therefore limiting the reuse of this dataset. In particular, it does not enable to reuse only a portion of Regulatory Circuits if a study focuses on a subset of the tissues, nor to combine the samples described in the datasets with samples from other studies. Overall, these limitations advocate for the design of a complete, flexible and reusable representation of the Regulatory Circuits dataset based on Semantic Web technologies.
RESULTS
We provide a modular RDF representation of the Regulatory Circuits, called Linked Extended Regulatory Circuits (LERC). It consists in (i) descriptions of biological and experimental context mapped to the references databases, (ii) annotations about TF-gene interactions at the sample level for 808 samples, (iii) annotations about TF-gene interactions at the tissue level for 394 tissues, (iv) metadata connecting the knowledge graphs cited above. LERC is based on a modular organisation into 1,205 RDF named graphs for representing the biological data, the sample-specific and the tissue-specific networks, and the corresponding metadata. In total it contains 3,910,794,050 triples and is available as a SPARQL endpoint.
CONCLUSION
The flexible and modular architecture of LERC supports biologically-relevant SPARQL queries. It allows an easy and fast querying of the resources related to the initial Regulatory Circuits datasets and facilitates its reuse in other studies. ASSOCIATED WEBSITE: https://regulatorycircuits-lod.genouest.org.

Identifiants

pubmed: 35346379
doi: 10.1186/s13326-022-00266-4
pii: 10.1186/s13326-022-00266-4
pmc: PMC8962212
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

11

Informations de copyright

© 2022. The Author(s).

Références

PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
IEEE J Biomed Health Inform. 2017 Nov 29;22(5):1672-1683
pubmed: 29990071
PLoS Comput Biol. 2005 Dec;1(7):e76
pubmed: 16738704
Database (Oxford). 2016 Jul 09;2016:
pubmed: 27402679
Genome Biol. 2015 Jan 05;16:22
pubmed: 25723102
Science. 1993 Oct 22;262(5133):502-3
pubmed: 8211171
ScientificWorldJournal. 2012;2012:435257
pubmed: 23365519
Nucleic Acids Res. 2021 Jan 8;49(D1):D884-D891
pubmed: 33137190
J Biomed Inform. 2006 Jun;39(3):314-20
pubmed: 16564748
Nature. 2012 Sep 6;489(7414):57-74
pubmed: 22955616
Nucleic Acids Res. 2018 Jan 4;46(D1):D380-D386
pubmed: 29087512
NPJ Digit Med. 2019 Sep 10;2:90
pubmed: 31531395
Nat Rev Genet. 2003 May;4(5):337-45
pubmed: 12728276
Nature. 2014 Mar 27;507(7493):455-461
pubmed: 24670763
Brief Bioinform. 2009 Jul;10(4):392-407
pubmed: 19457869
Sci Data. 2016 Mar 15;3:160018
pubmed: 26978244
Nucleic Acids Res. 2020 Jan 8;48(D1):D504-D510
pubmed: 31665520
Brief Bioinform. 2013 Jan;14(1):109-25
pubmed: 22492191
Nat Methods. 2016 Apr;13(4):366-70
pubmed: 26950747
Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489
pubmed: 33237286

Auteurs

Marine Louarn (M)

Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000, France. marine.louarn@inria.fr.
UMR_S1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, 35000, France. marine.louarn@inria.fr.

Fabrice Chatonnet (F)

UMR_S1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, 35000, France.
Laboratoire d'Hématologie, Pôle de Biologie, Centre Hospitalier Universitaire de Rennes, Rennes, 35033, France.

Xavier Garnier (X)

Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000, France.

Thierry Fest (T)

UMR_S1236, Université Rennes 1, INSERM, Etablissement Français du Sang, Rennes, 35000, France.
Laboratoire d'Hématologie, Pôle de Biologie, Centre Hospitalier Universitaire de Rennes, Rennes, 35033, France.

Anne Siegel (A)

Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000, France.

Catherine Faron (C)

Université Côte d'Azur, Inria, CNRS, I3S, Sophia-Antipolis, France.

Olivier Dameron (O)

Univ Rennes, CNRS, Inria, IRISA, UMR 6074, Rennes, F-35000, France. olivier.dameron@univ-rennes1.fr.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH