A workflow reproducibility scale for automatic validation of biological interpretation results.
provenance
reproducibility
workflow
Journal
GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872
Informations de publication
Date de publication:
28 12 2022
28 12 2022
Historique:
received:
02
11
2022
revised:
26
01
2023
accepted:
28
04
2023
medline:
9
5
2023
pubmed:
8
5
2023
entrez:
7
5
2023
Statut:
ppublish
Résumé
Reproducibility of data analysis workflow is a key issue in the field of bioinformatics. Recent computing technologies, such as virtualization, have made it possible to reproduce workflow execution with ease. However, the reproducibility of results is not well discussed; that is, there is no standard way to verify whether the biological interpretation of reproduced results is the same. Therefore, it still remains a challenge to automatically evaluate the reproducibility of results. We propose a new metric, a reproducibility scale of workflow execution results, to evaluate the reproducibility of results. This metric is based on the idea of evaluating the reproducibility of results using biological feature values (e.g., number of reads, mapping rate, and variant frequency) representing their biological interpretation. We also implemented a prototype system that automatically evaluates the reproducibility of results using the proposed metric. To demonstrate our approach, we conducted an experiment using workflows used by researchers in real research projects and the use cases that are frequently encountered in the field of bioinformatics. Our approach enables automatic evaluation of the reproducibility of results using a fine-grained scale. By introducing our approach, it is possible to evolve from a binary view of whether the results are superficially identical or not to a more graduated view. We believe that our approach will contribute to more informed discussion on reproducibility in bioinformatics.
Sections du résumé
BACKGROUND
Reproducibility of data analysis workflow is a key issue in the field of bioinformatics. Recent computing technologies, such as virtualization, have made it possible to reproduce workflow execution with ease. However, the reproducibility of results is not well discussed; that is, there is no standard way to verify whether the biological interpretation of reproduced results is the same. Therefore, it still remains a challenge to automatically evaluate the reproducibility of results.
RESULTS
We propose a new metric, a reproducibility scale of workflow execution results, to evaluate the reproducibility of results. This metric is based on the idea of evaluating the reproducibility of results using biological feature values (e.g., number of reads, mapping rate, and variant frequency) representing their biological interpretation. We also implemented a prototype system that automatically evaluates the reproducibility of results using the proposed metric. To demonstrate our approach, we conducted an experiment using workflows used by researchers in real research projects and the use cases that are frequently encountered in the field of bioinformatics.
CONCLUSIONS
Our approach enables automatic evaluation of the reproducibility of results using a fine-grained scale. By introducing our approach, it is possible to evolve from a binary view of whether the results are superficially identical or not to a more graduated view. We believe that our approach will contribute to more informed discussion on reproducibility in bioinformatics.
Identifiants
pubmed: 37150537
pii: 7150394
doi: 10.1093/gigascience/giad031
pmc: PMC10164546
pii:
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2023. Published by Oxford University Press GigaScience.
Références
PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
Bioinformatics. 2017 Aug 15;33(16):2580-2582
pubmed: 28379341
Gigascience. 2021 Feb 16;10(2):
pubmed: 33590861
Gigascience. 2022 Dec 28;12:
pubmed: 36810800
Nat Genet. 2011 May;43(5):491-8
pubmed: 21478889
Bioinformatics. 2016 Oct 1;32(19):3047-8
pubmed: 27312411
Bioinformatics. 2011 Aug 1;27(15):2156-8
pubmed: 21653522
Bioinformatics. 2013 Jan 1;29(1):15-21
pubmed: 23104886
Nat Biotechnol. 2017 Apr 11;35(4):316-319
pubmed: 28398311
Gigascience. 2019 Nov 1;8(11):
pubmed: 31675414
F1000Res. 2021 Mar 30;10:253
pubmed: 34367614
Nat Biotechnol. 2015 Apr;33(4):319
pubmed: 25850037
Nature. 2019 Sep;573(7772):149-150
pubmed: 31477884
Cell Genom. 2021 Nov 10;1(2):
pubmed: 35072136
Nucleic Acids Res. 2015 Jan;43(Database issue):D18-22
pubmed: 25477381
Front Genet. 2014 Jul 02;5:199
pubmed: 25071829
Nat Biotechnol. 2015 Jul;33(7):686-7
pubmed: 26154002
Nat Methods. 2021 Oct;18(10):1161-1168
pubmed: 34556866
Genome Res. 2010 Sep;20(9):1297-303
pubmed: 20644199
Nat Methods. 2014 Mar;11(3):211
pubmed: 24724161
Bioinformatics. 2013 May 15;29(10):1325-32
pubmed: 23479348
Bioinformatics. 2012 Oct 1;28(19):2520-2
pubmed: 22908215
Nat Biotechnol. 2020 Mar;38(3):276-278
pubmed: 32055031
Nature. 2016 May 25;533(7604):452-4
pubmed: 27225100
F1000Res. 2017 Jan 18;6:52
pubmed: 28344774
Nat Biotechnol. 2015 Mar;33(3):290-5
pubmed: 25690850
Genome Biol. 2010;11(5):207
pubmed: 20441614
Nat Rev Genet. 2016 May 17;17(6):333-51
pubmed: 27184599