Sparse generative modeling via parameter reduction of Boltzmann machines: Application to protein-sequence families.


Journal

Physical review. E
ISSN: 2470-0053
Titre abrégé: Phys Rev E
Pays: United States
ID NLM: 101676019

Informations de publication

Date de publication:
Aug 2021
Historique:
received: 17 02 2021
accepted: 19 07 2021
entrez: 16 9 2021
pubmed: 17 9 2021
medline: 17 9 2021
Statut: ppublish

Résumé

Boltzmann machines (BMs) are widely used as generative models. For example, pairwise Potts models (PMs), which are instances of the BM class, provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino acid conservation, and the two-site couplings, which mirror the coevolution between pairs of sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution. The most conservative choice to describe the coevolution signal is to include all possible two-site couplings into the PM. This choice, typical of what is known as Direct Coupling Analysis, has been successful for predicting residue contacts in the three-dimensional structure, mutational effects, and generating new functional sequences. However, the resulting PM suffers from important overfitting effects: many couplings are small, noisy, and hardly interpretable; the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a general parameter-reduction procedure for BMs, via a controlled iterative decimation of the less statistically significant couplings, identified by an information-based criterion that selects either weak or statistically unsupported couplings. For several protein families, our procedure allows one to remove more than 90% of the PM couplings, while preserving the predictive and generative properties of the original dense PM, and the resulting model is far away from criticality, hence more robust to noise.

Identifiants

pubmed: 34525554
doi: 10.1103/PhysRevE.104.024407
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

024407

Auteurs

Pierre Barrat-Charlaix (P)

Biozentrum, Universität Basel, Switzerland, Swiss Institute of Bioinformatics, Basel 4056, Switzerland.

Anna Paola Muntoni (AP)

Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino 10129, Italy.
Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo (TO), Italy.
Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France.
Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France.

Kai Shimagaki (K)

Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France.

Martin Weigt (M)

Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France.

Francesco Zamponi (F)

Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France.

Classifications MeSH