Anomaly detection in genomic catalogues using unsupervised multi-view autoencoders.

Anomaly detection ChIP-seq peak quality Cis regulatory element Convolutional autoencoder Genomic assay Model interpretability Unsupervised curation

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
25 Sep 2021
Historique:
received: 15 12 2020
accepted: 09 08 2021
revised: 04 06 2021
entrez: 26 9 2021
pubmed: 27 9 2021
medline: 29 9 2021
Statut: epublish

Résumé

Accurate identification of Transcriptional Regulator binding locations is essential for analysis of genomic regions, including Cis Regulatory Elements. The customary NGS approaches, predominantly ChIP-Seq, can be obscured by data anomalies and biases which are difficult to detect without supervision. Here, we develop a method to leverage the usual combinations between many experimental series to mark such atypical peaks. We use deep learning to perform a lossy compression of the genomic regions' representations with multiview convolutions. Using artificial data, we show that our method correctly identifies groups of correlating series and evaluates CRE according to group completeness. It is then applied to the ReMap database's large volume of curated ChIP-seq data. We show that peaks lacking known biological correlators are singled out and less confirmed in real data. We propose normalization approaches useful in interpreting black-box models. Our approach detects peaks that are less corroborated than average. It can be extended to other similar problems, and can be interpreted to identify correlation groups. It is implemented in an open-source tool called atyPeak.

Sections du résumé

BACKGROUND BACKGROUND
Accurate identification of Transcriptional Regulator binding locations is essential for analysis of genomic regions, including Cis Regulatory Elements. The customary NGS approaches, predominantly ChIP-Seq, can be obscured by data anomalies and biases which are difficult to detect without supervision.
RESULTS RESULTS
Here, we develop a method to leverage the usual combinations between many experimental series to mark such atypical peaks. We use deep learning to perform a lossy compression of the genomic regions' representations with multiview convolutions. Using artificial data, we show that our method correctly identifies groups of correlating series and evaluates CRE according to group completeness. It is then applied to the ReMap database's large volume of curated ChIP-seq data. We show that peaks lacking known biological correlators are singled out and less confirmed in real data. We propose normalization approaches useful in interpreting black-box models.
CONCLUSION CONCLUSIONS
Our approach detects peaks that are less corroborated than average. It can be extended to other similar problems, and can be interpreted to identify correlation groups. It is implemented in an open-source tool called atyPeak.

Identifiants

pubmed: 34563116
doi: 10.1186/s12859-021-04359-2
pii: 10.1186/s12859-021-04359-2
pmc: PMC8467021
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

460

Informations de copyright

© 2021. The Author(s).

Références

Nucleic Acids Res. 2014 Feb;42(4):e24
pubmed: 24217919
Bioinformatics. 2019 Jan 1;35(1):112-118
pubmed: 29939222
Brief Bioinform. 2016 Nov;17(6):967-979
pubmed: 26634919
Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5
pubmed: 23193258
Nucleic Acids Res. 2018 Jan 4;46(D1):D267-D275
pubmed: 29126285
Sci Rep. 2019 Jun 27;9(1):9354
pubmed: 31249361
Genome Biol. 2003;5(1):201
pubmed: 14709165
Nucleic Acids Res. 2020 Jan 8;48(D1):D180-D188
pubmed: 31665499
Nucleic Acids Res. 2011 Jan;39(Database issue):D1002-4
pubmed: 21071405
Bioinformatics. 2014 Oct 15;30(20):2843-51
pubmed: 24974202
Mol Cell. 2010 Feb 12;37(3):429-37
pubmed: 20159561
Proc Natl Acad Sci U S A. 2013 Nov 12;110(46):18602-7
pubmed: 24173036
Nucleic Acids Res. 2016 Jun 20;44(11):e107
pubmed: 27084946
Genome Res. 2012 Sep;22(9):1813-31
pubmed: 22955991
Nucleic Acids Res. 2015 Aug 18;43(14):6959-68
pubmed: 26117547
Nat Immunol. 2011 Sep 20;12(10):918-22
pubmed: 21934668
Nature. 2012 Sep 6;489(7414):57-74
pubmed: 22955616
Nat Rev Genet. 2019 Jul;20(7):389-403
pubmed: 30971806
Nucleic Acids Res. 2013 Nov;41(20):9230-42
pubmed: 23945931
J Biol Chem. 2013 Mar 1;288(9):6238-47
pubmed: 23349461
PLoS One. 2010 Jul 08;5(7):e11471
pubmed: 20628599
Science. 2007 Jun 8;316(5830):1497-502
pubmed: 17540862
Bioinformatics. 2017 Jul 15;33(14):i225-i233
pubmed: 28881977
Brief Bioinform. 2018 Mar 1;19(2):325-340
pubmed: 28011753
Bioinformatics. 2019 Oct 1;35(19):3592-3598
pubmed: 30824903
Nucleic Acids Res. 2014 Jun;42(10):6256-69
pubmed: 24753418
Bioinformatics. 2012 Mar 1;28(5):607-13
pubmed: 22262674
Nat Cell Biol. 2017 Aug;19(8):952-961
pubmed: 28737770
Cancer Genet. 2013 Dec;206(12):441-8
pubmed: 24528889
Cell. 2018 Feb 8;172(4):650-665
pubmed: 29425488

Auteurs

Quentin Ferré (Q)

INSERM, TAGC, Aix Marseille University, Marseille, France.
Université de Toulon, CNRS, LIS, Aix Marseille University, Marseille, France.

Jeanne Chèneby (J)

INSERM, TAGC, Aix Marseille University, Marseille, France.

Denis Puthier (D)

INSERM, TAGC, Aix Marseille University, Marseille, France.

Cécile Capponi (C)

Université de Toulon, CNRS, LIS, Aix Marseille University, Marseille, France. Cecile.Capponi@lis-lab.fr.

Benoît Ballester (B)

INSERM, TAGC, Aix Marseille University, Marseille, France. benoit.ballester@univ-amu.fr.

Articles similaires

Coal Metagenome Phylogeny Bacteria Genome, Bacterial
Genome, Bacterial Virulence Phylogeny Genomics Plant Diseases
Host Specificity Bacteriophages Genomics Algorithms Escherichia coli
Genome, Plant Medicago sativa Crops, Agricultural Genomics Polyploidy

Classifications MeSH