PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
01 02 2020
Historique:
received: 03 04 2019
revised: 17 07 2019
accepted: 07 08 2019
pubmed: 9 8 2019
medline: 18 9 2020
entrez: 9 8 2019
Statut: ppublish

Résumé

Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data. In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors. http://pengaroo.erc.monash.edu/. Supplementary data are available at Bioinformatics online.

Identifiants

pubmed: 31393553
pii: 5545087
doi: 10.1093/bioinformatics/btz629
doi:

Substances chimiques

Peptides 0
Proteins 0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

704-712

Informations de copyright

© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Auteurs

Yanju Zhang (Y)

Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.

Sha Yu (S)

Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.
Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia.

Ruopeng Xie (R)

Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.
Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia.

Jiahui Li (J)

Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.
Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia.

André Leier (A)

Department of Genetics, AL, USA.
Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA.

Tatiana T Marquez-Lago (TT)

Department of Genetics, AL, USA.
Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA.

Tatsuya Akutsu (T)

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan.

A Ian Smith (AI)

Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia.
ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, VIC 3800, Australia.

Zongyuan Ge (Z)

Monash e-Research Centre and Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia.

Jiawei Wang (J)

Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia.

Trevor Lithgow (T)

Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia.

Jiangning Song (J)

Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia.
ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, VIC 3800, Australia.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Databases, Protein Protein Domains Protein Folding Proteins Deep Learning

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Understanding the role of machine learning in predicting progression of osteoarthritis.

Simone Castagno, Benjamin Gompels, Estelle Strangmark et al.
1.00
Humans Disease Progression Machine Learning Osteoarthritis

Classifications MeSH