Towards parsimonious generative modeling of RNA families.


Journal

Nucleic acids research
ISSN: 1362-4962
Titre abrégé: Nucleic Acids Res
Pays: England
ID NLM: 0411011

Informations de publication

Date de publication:
25 Apr 2024
Historique:
accepted: 05 04 2024
revised: 05 03 2024
received: 14 10 2023
medline: 25 4 2024
pubmed: 25 4 2024
entrez: 25 4 2024
Statut: aheadofprint

Résumé

Generative probabilistic models emerge as a new paradigm in data-driven, evolution-informed design of biomolecular sequences. This paper introduces a novel approach, called Edge Activation Direct Coupling Analysis (eaDCA), tailored to the characteristics of RNA sequences, with a strong emphasis on simplicity, efficiency, and interpretability. eaDCA explicitly constructs sparse coevolutionary models for RNA families, achieving performance levels comparable to more complex methods while utilizing a significantly lower number of parameters. Our approach demonstrates efficiency in generating artificial RNA sequences that closely resemble their natural counterparts in both statistical analyses and SHAPE-MaP experiments, and in predicting the effect of mutations. Notably, eaDCA provides a unique feature: estimating the number of potential functional sequences within a given RNA family. For example, in the case of cyclic di-AMP riboswitches (RF00379), our analysis suggests the existence of approximately 1039 functional nucleotide sequences. While huge compared to the known <4000 natural sequences, this number represents only a tiny fraction of the vast pool of nearly 1082 possible nucleotide sequences of the same length (136 nucleotides). These results underscore the promise of sparse and interpretable generative models, such as eaDCA, in enhancing our understanding of the expansive RNA sequence space.

Identifiants

pubmed: 38661206
pii: 7658050
doi: 10.1093/nar/gkae289
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : H2020 European Research Council
ID : AbioEvo/101002075
Organisme : H2020 Marie Sklodowska-Curie Actions
ID : InferNet/734439
Organisme : Agence Nationale de la Recherche
ID : ANR-10-EQPX-34
Organisme : Human Frontier Science Program
ID : RGY0077
Organisme : H2020 Marie Sklodowska-Curie Actions
ID : AI4theSciences/945304

Informations de copyright

© The Author(s) 2024. Published by Oxford University Press on behalf of Nucleic Acids Research.

Auteurs

Francesco Calvanese (F)

Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative - LCQB, Paris, France.
Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France.

Camille N Lambert (CN)

Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France.

Philippe Nghe (P)

Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France.

Francesco Zamponi (F)

Dipartimento di Fisica, Sapienza Università di Roma, Rome, Italy.
Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, Paris, France.

Martin Weigt (M)

Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative - LCQB, Paris, France.

Classifications MeSH