A comparative analysis of ENCODE and Cistrome in the context of TF binding signal.
Cistrome
Database
ENCODE
SignalValue
Transcription Factors
Journal
BMC genomics
ISSN: 1471-2164
Titre abrégé: BMC Genomics
Pays: England
ID NLM: 100965258
Informations de publication
Date de publication:
30 Aug 2024
30 Aug 2024
Historique:
received:
02
11
2022
accepted:
25
07
2024
medline:
31
8
2024
pubmed:
31
8
2024
entrez:
29
8
2024
Statut:
epublish
Résumé
With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - ENCODE and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent. Moreover, the output format of the processing (BED narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but can offer valuable insight on the quality of the data. We provide evidence that data points with high signalValue(s) (top 25% of values) are more likely to be consistent between ENCODE and Cistrome in human cell lines K562, GM12878, and HepG2. In addition, we show that filtering according to said high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information. Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from ENCODE and Cistrome. The signalValue feature is an informative feature that can be effectively used to highlight consistent areas of overlap between different sources of TF binding sites that expose it. Its applicability extends to downstream to positional machine learning algorithms, making it a powerful tool for performance tweaking and data aggregation.
Sections du résumé
BACKGROUND
BACKGROUND
With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - ENCODE and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent. Moreover, the output format of the processing (BED narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but can offer valuable insight on the quality of the data.
RESULTS
RESULTS
We provide evidence that data points with high signalValue(s) (top 25% of values) are more likely to be consistent between ENCODE and Cistrome in human cell lines K562, GM12878, and HepG2. In addition, we show that filtering according to said high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information. Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from ENCODE and Cistrome.
CONCLUSIONS
CONCLUSIONS
The signalValue feature is an informative feature that can be effectively used to highlight consistent areas of overlap between different sources of TF binding sites that expose it. Its applicability extends to downstream to positional machine learning algorithms, making it a powerful tool for performance tweaking and data aggregation.
Identifiants
pubmed: 39210256
doi: 10.1186/s12864-024-10668-6
pii: 10.1186/s12864-024-10668-6
pmc: PMC11363379
doi:
Substances chimiques
Transcription Factors
0
Types de publication
Journal Article
Comparative Study
Langues
eng
Sous-ensembles de citation
IM
Pagination
817Subventions
Organisme : National Research Foundation Singapore
ID : SBP-P3
Organisme : Ministry of Education Singapore
ID : MOE T1 251RES1725
Organisme : European Research Council
ID : AdG 693174
Pays : International
Informations de copyright
© 2024. The Author(s).
Références
Nucleic Acids Res. 2019 Jan 8;47(D1):D559-D563
pubmed: 30357367
Genome Biol. 2011 Aug 22;12(8):R83
pubmed: 21859476
BMC Bioinformatics. 2010 Nov 11;11:554
pubmed: 21070640
J Cell Sci. 2005 Nov 1;118(Pt 21):4947-57
pubmed: 16254242
Genomics Proteomics Bioinformatics. 2018 Oct;16(5):342-353
pubmed: 30578913
Bioinformatics. 2015 Jun 15;31(12):1881-8
pubmed: 25649616
Endocrinology. 2012 Jan;153(1):492-500
pubmed: 22067325
Genome Biol. 2008;9(9):R137
pubmed: 18798982
BMC Bioinformatics. 2016 Oct 3;17(1):404
pubmed: 27716038
Gene. 2014 Jul 15;545(1):80-7
pubmed: 24797614
Nucleic Acids Res. 2017 Jan 4;45(D1):D658-D662
pubmed: 27789702
J Comput Biol. 2018 Nov;25(11):1247-1256
pubmed: 30133315
Genome Res. 2012 Sep;22(9):1813-31
pubmed: 22955991
Nucleic Acids Res. 2011 Jul;39(Web Server issue):W430-6
pubmed: 21586587
Nat Rev Genet. 2009 Oct;10(10):669-80
pubmed: 19736561
Genome Biol. 2012 Sep 26;13(9):R48
pubmed: 22950945
Nucleic Acids Res. 2018 Jan 4;46(D1):D794-D801
pubmed: 29126249
Nat Biotechnol. 2008 Dec;26(12):1351-9
pubmed: 19029915
BMC Genomics. 2010 Oct 19;11:581
pubmed: 20958978
Nucleic Acids Res. 2019 Jan 8;47(D1):D529-D541
pubmed: 30476227
Sci Rep. 2016 Apr 27;6:25164
pubmed: 27117388
Brief Bioinform. 2011 Nov;12(6):626-33
pubmed: 21059603
Nucleic Acids Res. 2018 Jan 4;46(D1):D380-D386
pubmed: 29087512
Nucleic Acids Res. 2019 Jan 8;47(D1):D729-D735
pubmed: 30462313
Sci Rep. 2019 Jun 27;9(1):9354
pubmed: 31249361
Brief Bioinform. 2020 Sep 25;21(5):1523-1530
pubmed: 31624847
Ecology. 2015 Feb;96(2):575-86
pubmed: 26240877
Science. 2004 Oct 22;306(5696):636-40
pubmed: 15499007
Cell Biol Int. 2013 Oct;37(10):1038-45
pubmed: 23723166
Cell. 2018 Feb 8;172(4):650-665
pubmed: 29425488