ProteinFlow: An advanced framework for feature engineering in protein data analysis.

data preprocessing feature engineering multidimensional feature extraction protein data analysis proteins

Journal

Biotechnology and bioengineering

ISSN: 1097-0290

Titre abrégé: Biotechnol Bioeng

Pays: United States

ID NLM: 7502021

Informations de publication

Date de publication:
23 Jul 2024

Historique:

revised: 04 06 2024

received: 12 02 2024

accepted: 10 07 2024

medline: 24 7 2024

pubmed: 24 7 2024

entrez: 24 7 2024

Statut: aheadofprint

Résumé

In the burgeoning field of proteins, the effective analysis of intricate protein data remains a formidable challenge, necessitating advanced computational tools for data processing, feature extraction, and interpretation. This study introduces ProteinFlow, an innovative framework designed to revolutionize feature engineering in protein data analysis. ProteinFlow stands out by offering enhanced efficiency in data collection and preprocessing, along with advanced capabilities in feature extraction, directly addressing the complexities inherent in multidimensional protein data sets. Through a comparative analysis, ProteinFlow demonstrated a significant improvement over traditional methods, notably reducing data preprocessing time and expanding the scope of biologically significant features identified. The framework's parallel data processing strategy and advanced algorithms ensure not only rapid data handling but also the extraction of comprehensive, meaningful insights from protein sequences, structures, and interactions. Furthermore, ProteinFlow exhibits remarkable scalability, adeptly managing large-scale data sets without compromising performance, a crucial attribute in the era of big data.

Identifiants

DOI: 10.1002/bit.28812 PMID: 39044472

pubmed: 39044472

doi: 10.1002/bit.28812

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Subventions

Organisme : Science Foundation Ireland

ID : 18/CRT/6223

Pays : Ireland

Informations de copyright

Références

Anurag, V. N. (2018). Distributed computing with Go: Practical concurrency and parallelism for Go applications. Packt Publishing Ltd.

Bawono, P., Dijkstra, M., Pirovano, W., Feenstra, A., Abeln, S., & Heringa, J. (2017). Multiple sequence alignment. In J. M. Keith (Ed.), Bioinformatics: Volume I: Data, sequence analysis, and evolution (pp. 167–189). Humana.

Burley, S. K., Berman, H. M., Kleywegt, G. J., Markley, J. L., Nakamura, H., & Velankar, S. (2017). Protein Data Bank (PDB): The single global macromolecular structure archive. Protein Crystallography: Methods and Protocols, 1607, 627–641.

Chen, C., Huang, H., & Wu, C. H. (2017). Protein bioinformatics databases and resources. Protein Bioinformatics: From Protein Modifications and Networks to Proteomics, 1558, 3–39.

Ciamponi, F. E., Lovci, M. T., Cruz, P. R., & Massirer, K. B. (2018). BioFeatureFinder: Flexible, unbiased analysis of biological characteristics associated with genomic regions. bioRxiv. [Preprint] Available from: https://doi.org/10.1101/279612

DuBois, P. (2013). MySQL. Addison‐Wesley.

Frauenfelder, H. (2002). Proteins: Paradigms of complexity. Proceedings of the National Academy of Sciences of the United States of America, 99(Suppl. 1), 2479–2480.

Gasteiger, E., Hoogland, C., Gattiker, A., Duvaud, S., Wilkins, M., Appel, R., & Bairoch, A. (2005). Expasy ProtParam. Swiss Institute of Bioinformatics.

Guruprasad, K., Reddy, B. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition: A novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Engineering, Design and Selection, 4(2), 155–161.

Jin, X., Liao, Q., Wei, H., Zhang, J., & Liu, B. (2021). SMI‐BLAST: A novel supervised search framework based on PSI‐BLAST for protein remote homology detection. Bioinformatics, 37(7), 913–920.

Jones, P., Binns, D., Chang, H.‐Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A. F., Sangrador‐Vegas, A., Scheremetjew, M., Yong, S.‐Y., Lopez, R., & Hunter, S. (2014). InterProScan 5: Genome‐scale protein function classification. Bioinformatics, 30(9), 1236–1240.

Kastritis, P. L., Visscher, K. M., van Dijk, A. D., & Bonvin, A. M. (2013). Solvated protein‐protein docking using Kyte‐Doolittle‐based water preferences. Proteins: Structure, Function, and Bioinformatics, 81(3), 510–518.

Laskowski, R. A., Jabłońska, J., Pravda, L., Vařeková, R. S., & Thornton, J. M. (2018). PDBsum: Structural summaries of PDB entries. Protein Science, 27(1), 129–134.

Lee, Y., & Lim, Y.‐H. (2022). Concurrency processing comparison of large data list using GO language. The Journal of the Convergence on Culture Technology, 8(2), 361–366.

Mi, Y., Marcu, S.‐B., Tabirca, S., & Yallapragada, V. V. (2024). PS‐GO parametric protein search engine. Computational and Structural Biotechnology Journal, 23, 1499–1509.

O'Boyle, N. M., Banck, M., James, C. A., Morley, C., Vandermeersch, T., & Hutchison, G. R. (2011). Open Babel: An open chemical toolbox. Journal of Cheminformatics, 3(1), 1–14.

Ofer, D., & Linial, M. (2015). ProFET: Feature engineering captures high‐level protein functions. Bioinformatics, 31(21), 3429–3436.

O'Leary, N. A., Wright, M. W., Brister, J. R., Ciufo, S., Haddad, D., McVeigh, R., Rajput, B., Robbertse, B., Smith‐White, B., Ako‐Adjei, D., Astashyn, A., Badretdin, A., Bao, Y., Blinkova, O., Brover, V., Chetvernin, V., Choi, J., Cox, E., Ermolaeva, O., … Pruitt, K. D. (2016). Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Research, 44(D1), D733–D745.

Ozols, J. (1990). [44] Amino acid analysis. Methods in Enzymology, 182, 587–601.

Pollastri, G., Baldi, P., Fariselli, P., & Casadio, R. (2002). Prediction of coordination number and relative solvent accessibility in proteins. Proteins: Structure, Function, and Bioinformatics, 47(2), 142–153.

Smith, M. H. (1966). The amino acid composition of proteins. Journal of Theoretical Biology, 13, 261–282.

Szklarczyk, D., Gable, A. L., Lyon, D., Junge, A., Wyder, S., Huerta‐Cepas, J., Simonovic, M., Doncheva, N. T., Morris, J. H., Bork, P., Jensen, L. J., & von Mering, C. (2019). STRING v11: Protein‐protein association networks with increased coverage, supporting functional discovery in genome‐wide experimental datasets. Nucleic Acids Research, 47(D1), D607–D613.

Tokmakov, A. A., Kurotani, A., & Sato, K.‐I. (2021). Protein pI and intracellular localization. Frontiers in Molecular Biosciences, 8, 775736.

UniProt Consortium. (2023). Uniprot: The universal protein knowledgebase in 2023. Nucleic Acids Research, 51(D1), D523–D531.

Van Oss, C. J. (1997). Hydrophobicity and hydrophilicity of biosurfaces. Current Opinion in Colloid & Interface Science, 2(5), 503–512.

Webb, B., & Sali, A. (2016). Comparative protein structure modeling using MODELLER. Current Protocols in Bioinformatics, 54(1), 5–6.

Williams, C. J., Headd, J. J., Moriarty, N. W., Prisant, M. G., Videau, L. L., Deis, L. N., Verma, V., Keedy, D. A., Hintze, B. J., Chen, V. B., Jain, S., Lewis, S. M., Arendall III, W. B., Snoeyink, J., Adams, P. D., Lovell, S. C., Richardson, J. S., & Richardson, D. C. (2018). MolProbity: More and better reference data for improved all‐atom structure validation. Protein Science, 27(1), 293–315.

Xia, X. (2007). Protein isoelectric point. In: Bioinformatics and the cell: Modern computational approaches in genomics, proteomics and transcriptomics (pp. 207–219). Springer.

Zhang, Y., & Skolnick, J. (2005). TM‐align: A protein structure alignment algorithm based on the TM‐score. Nucleic Acids Research, 33(7), 2302–2309.

Zheng, A., & Casari, A. (2018). Feature engineering for machine learning: Principles and techniques for data scientists. O'Reilly Media, Inc.

ProteinFlow: An advanced framework for feature engineering in protein data analysis.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Subventions

Informations de copyright

Références

Auteurs

Yanlin Mi (Y)

Stefan-Bogdan Marcu (SB)

Venkata V B Yallapragada (VVB)

Sabin Tabirca (S)

Classifications MeSH