HEMDAG: a family of modular and scalable hierarchical ensemble methods to improve Gene Ontology term prediction.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
07 12 2021
Historique:
received: 29 03 2021
revised: 15 06 2021
accepted: 04 07 2021
medline: 13 4 2023
pubmed: 10 7 2021
entrez: 9 7 2021
Statut: ppublish

Résumé

Automated protein function prediction is a complex multi-class, multi-label, structured classification problem in which protein functions are organized in a controlled vocabulary, according to the Gene Ontology (GO). 'Hierarchy-unaware' classifiers, also known as 'flat' methods, predict GO terms without exploiting the inherent structure of the ontology, potentially violating the True-Path-Rule (TPR) that governs the GO, while 'hierarchy-aware' approaches, even if they obey the TPR, do not always show clear improvements with respect to flat methods, or do not scale well when applied to the full GO. To overcome these limitations, we propose Hierarchical Ensemble Methods for Directed Acyclic Graphs (HEMDAG), a family of highly modular hierarchical ensembles of classifiers, able to build upon any flat method and to provide 'TPR-safe' predictions, by leveraging a combination of isotonic regression and TPR learning strategies. Extensive experiments on synthetic and real data across several organisms firstly show that HEMDAG can be used as a general tool to improve the predictions of flat classifiers, and secondly that HEMDAG is competitive versus state-of-the-art hierarchy-aware learning methods proposed in the last CAFA international challenges. Fully tested R code freely available at https://anaconda.org/bioconda/r-hemdag. Tutorial and documentation at https://hemdag.readthedocs.io. Supplementary data are available at Bioinformatics online.

Identifiants

pubmed: 34240108
pii: 6317663
doi: 10.1093/bioinformatics/btab485
doi:

Substances chimiques

Proteins 0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

4526-4533

Informations de copyright

© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Auteurs

Marco Notaro (M)

AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy.

Marco Frasca (M)

AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy.

Alessandro Petrini (A)

AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy.

Jessica Gliozzo (J)

AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy.

Elena Casiraghi (E)

AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy.

Peter N Robinson (PN)

The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.

Giorgio Valentini (G)

AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy.
CINI, National Laboratory in Artificial Intelligence and Intelligent Systems-AIIS, Roma 00185, Italy.
Data Science Research Center, Università degli Studi di Milano, Milano 20133, Italy.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Databases, Protein Protein Domains Protein Folding Proteins Deep Learning
1.00
Humans Magnetic Resonance Imaging Brain Infant, Newborn Infant, Premature
Humans Colorectal Neoplasms Biomarkers, Tumor Prognosis Gene Expression Regulation, Neoplastic

Classifications MeSH