rstoolbox - a Python library for large-scale analysis of computational protein design data and structural bioinformatics.


Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
15 May 2019
Historique:
received: 23 01 2019
accepted: 08 04 2019
entrez: 17 5 2019
pubmed: 17 5 2019
medline: 21 6 2019
Statut: epublish

Résumé

Large-scale datasets of protein structures and sequences are becoming ubiquitous in many domains of biological research. Experimental approaches and computational modelling methods are generating biological data at an unprecedented rate. The detailed analysis of structure-sequence relationships is critical to unveil governing principles of protein folding, stability and function. Computational protein design (CPD) has emerged as an important structure-based approach to engineer proteins for novel functions. Generally, CPD workflows rely on the generation of large numbers of structural models to search for the optimal structure-sequence configurations. As such, an important step of the CPD process is the selection of a small subset of sequences to be experimentally characterized. Given the limitations of current CPD scoring functions, multi-step design protocols and elaborated analysis of the decoy populations have become essential for the selection of sequences for experimental characterization and the success of CPD strategies. Here, we present the rstoolbox, a Python library for the analysis of large-scale structural data tailored for CPD applications. rstoolbox is oriented towards both CPD software users and developers, being easily integrated in analysis workflows. For users, it offers the ability to profile and select decoy sets, which may guide multi-step design protocols or for follow-up experimental characterization. rstoolbox provides intuitive solutions for the visualization of large sequence/structure datasets (e.g. logo plots and heatmaps) and facilitates the analysis of experimental data obtained through traditional biochemical techniques (e.g. circular dichroism and surface plasmon resonance) and high-throughput sequencing. For CPD software developers, it provides a framework to easily benchmark and compare different CPD approaches. Here, we showcase the rstoolbox in both types of applications. rstoolbox is a library for the evaluation of protein structures datasets tailored for CPD data. It provides interactive access through seamless integration with IPython, while still being suitable for high-performance computing. In addition to its functionalities for data analysis and graphical representation, the inclusion of rstoolbox in protein design pipelines will allow to easily standardize the selection of design candidates, as well as, to improve the overall reproducibility and robustness of CPD selection processes.

Sections du résumé

BACKGROUND BACKGROUND
Large-scale datasets of protein structures and sequences are becoming ubiquitous in many domains of biological research. Experimental approaches and computational modelling methods are generating biological data at an unprecedented rate. The detailed analysis of structure-sequence relationships is critical to unveil governing principles of protein folding, stability and function. Computational protein design (CPD) has emerged as an important structure-based approach to engineer proteins for novel functions. Generally, CPD workflows rely on the generation of large numbers of structural models to search for the optimal structure-sequence configurations. As such, an important step of the CPD process is the selection of a small subset of sequences to be experimentally characterized. Given the limitations of current CPD scoring functions, multi-step design protocols and elaborated analysis of the decoy populations have become essential for the selection of sequences for experimental characterization and the success of CPD strategies.
RESULTS RESULTS
Here, we present the rstoolbox, a Python library for the analysis of large-scale structural data tailored for CPD applications. rstoolbox is oriented towards both CPD software users and developers, being easily integrated in analysis workflows. For users, it offers the ability to profile and select decoy sets, which may guide multi-step design protocols or for follow-up experimental characterization. rstoolbox provides intuitive solutions for the visualization of large sequence/structure datasets (e.g. logo plots and heatmaps) and facilitates the analysis of experimental data obtained through traditional biochemical techniques (e.g. circular dichroism and surface plasmon resonance) and high-throughput sequencing. For CPD software developers, it provides a framework to easily benchmark and compare different CPD approaches. Here, we showcase the rstoolbox in both types of applications.
CONCLUSIONS CONCLUSIONS
rstoolbox is a library for the evaluation of protein structures datasets tailored for CPD data. It provides interactive access through seamless integration with IPython, while still being suitable for high-performance computing. In addition to its functionalities for data analysis and graphical representation, the inclusion of rstoolbox in protein design pipelines will allow to easily standardize the selection of design candidates, as well as, to improve the overall reproducibility and robustness of CPD selection processes.

Identifiants

pubmed: 31092198
doi: 10.1186/s12859-019-2796-3
pii: 10.1186/s12859-019-2796-3
pmc: PMC6521408
doi:

Substances chimiques

Proteins 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

240

Subventions

Organisme : European Research Council
ID : 716058
Pays : International
Organisme : Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung
ID : 310030_163139

Références

Proteins. 1999;Suppl 3:171-6
pubmed: 10526365
Nucleic Acids Res. 2000 Jan 1;28(1):235-42
pubmed: 10592235
Nat Struct Biol. 2000 Aug;7(8):674-8
pubmed: 10932253
Proc Natl Acad Sci U S A. 2000 Sep 12;97(19):10383-8
pubmed: 10984534
Proc Natl Acad Sci U S A. 2001 Dec 4;98(25):14274-9
pubmed: 11724958
J Biol Chem. 2002 Aug 30;277(35):32094-8
pubmed: 12068017
Protein Eng. 2002 Oct;15(10):779-82
pubmed: 12468711
J Mol Biol. 1963 Jul;7:95-9
pubmed: 13990617
Science. 2003 Nov 21;302(5649):1364-8
pubmed: 14631033
Nucleic Acids Res. 2008 Jan;36(Database issue):D419-25
pubmed: 18000004
Curr Protoc Bioinformatics. 2002 Aug;Chapter 2:Unit 2.3
pubmed: 18792934
Proc Natl Acad Sci U S A. 2009 Mar 10;106(10):3764-9
pubmed: 19228942
J Mol Biol. 2009 Oct 16;393(1):249-60
pubmed: 19646450
Structure. 2009 Sep 9;17(9):1244-52
pubmed: 19748345
Nat Struct Mol Biol. 2010 Feb;17(2):248-50
pubmed: 20098425
Proc Natl Acad Sci U S A. 2010 Aug 3;107(31):13707-12
pubmed: 20643959
Nucleic Acids Res. 1990 Oct 25;18(20):6097-100
pubmed: 2172928
Methods Enzymol. 2013;523:87-107
pubmed: 23422427
PLoS One. 2013 May 21;8(5):e63090
pubmed: 23704889
PLoS One. 2013 Oct 01;8(10):e75992
pubmed: 24098414
Cell. 2014 Jun 19;157(7):1644-1656
pubmed: 24949974
Curr Opin Struct Biol. 2016 Aug;39:16-26
pubmed: 27086078
Science. 2017 Jan 13;355(6321):201-206
pubmed: 28082595
J Chem Theory Comput. 2017 Jun 13;13(6):3031-3048
pubmed: 28430426
Curr Opin Biotechnol. 2018 Aug;52:145-152
pubmed: 29729544
Nucleic Acids Res. 2018 Jul 2;46(W1):W200-W204
pubmed: 29905871
Nucleic Acids Res. 2019 Jan 8;47(D1):D280-D284
pubmed: 30398663
PLoS Comput Biol. 2018 Nov 19;14(11):e1006623
pubmed: 30452434
Proc Natl Acad Sci U S A. 1987 Oct;84(19):6611-5
pubmed: 3477791
Science. 1997 Oct 3;278(5335):82-7
pubmed: 9311930

Auteurs

Jaume Bonet (J)

Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland.
Swiss Institute of Bioinformatics (SIB), CH-1015, Lausanne, Switzerland.

Zander Harteveld (Z)

Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland.
Swiss Institute of Bioinformatics (SIB), CH-1015, Lausanne, Switzerland.

Fabian Sesterhenn (F)

Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland.
Swiss Institute of Bioinformatics (SIB), CH-1015, Lausanne, Switzerland.

Andreas Scheck (A)

Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland.
Swiss Institute of Bioinformatics (SIB), CH-1015, Lausanne, Switzerland.

Bruno E Correia (BE)

Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland. bruno.correia@epfl.ch.
Swiss Institute of Bioinformatics (SIB), CH-1015, Lausanne, Switzerland. bruno.correia@epfl.ch.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Databases, Protein Protein Domains Protein Folding Proteins Deep Learning
Animals Hemiptera Insect Proteins Phylogeny Insecticides

Classifications MeSH