Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences.


Journal

Physical review. E
ISSN: 2470-0053
Titre abrégé: Phys Rev E
Pays: United States
ID NLM: 101676019

Informations de publication

Date de publication:
Mar 2020
Historique:
received: 23 12 2019
accepted: 04 03 2020
entrez: 16 4 2020
pubmed: 16 4 2020
medline: 26 1 2021
Statut: ppublish

Résumé

Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g., direct coupling analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins and interblock couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available and that an iterative pairing algorithm allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if their quality are imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.

Identifiants

pubmed: 32290011
doi: 10.1103/PhysRevE.101.032413
doi:

Substances chimiques

Proteins 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

032413

Auteurs

Carlos A Gandarilla-Pérez (CA)

Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), F-75005 Paris, France.
Facultad de Física, Universidad de la Habana, San Lázaro y L, Vedado, Habana 4, CP-10400, Cuba.

Pierre Mergny (P)

Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), F-75005 Paris, France.
Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire Jean Perrin (LJP, UMR 8237), F-75005 Paris, France.

Martin Weigt (M)

Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), F-75005 Paris, France.

Anne-Florence Bitbol (AF)

Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire Jean Perrin (LJP, UMR 8237), F-75005 Paris, France.
Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland.

Articles similaires

Databases, Protein Protein Domains Protein Folding Proteins Deep Learning

High-throughput Bronchus-on-a-Chip system for modeling the human bronchus.

Akina Mori, Marjolein Vermeer, Lenie J van den Broek et al.
1.00
Humans Bronchi Lab-On-A-Chip Devices Epithelial Cells Goblet Cells
Humans Colorectal Neoplasms Biomarkers, Tumor Prognosis Gene Expression Regulation, Neoplastic
1.00
Algorithms Computer Simulation Models, Biological Programming Languages Humans

Classifications MeSH