ProteinFlow: An advanced framework for feature engineering in protein data analysis.
data preprocessing
feature engineering
multidimensional feature extraction
protein data analysis
proteins
Journal
Biotechnology and bioengineering
ISSN: 1097-0290
Titre abrégé: Biotechnol Bioeng
Pays: United States
ID NLM: 7502021
Informations de publication
Date de publication:
23 Jul 2024
23 Jul 2024
Historique:
revised:
04
06
2024
received:
12
02
2024
accepted:
10
07
2024
medline:
24
7
2024
pubmed:
24
7
2024
entrez:
24
7
2024
Statut:
aheadofprint
Résumé
In the burgeoning field of proteins, the effective analysis of intricate protein data remains a formidable challenge, necessitating advanced computational tools for data processing, feature extraction, and interpretation. This study introduces ProteinFlow, an innovative framework designed to revolutionize feature engineering in protein data analysis. ProteinFlow stands out by offering enhanced efficiency in data collection and preprocessing, along with advanced capabilities in feature extraction, directly addressing the complexities inherent in multidimensional protein data sets. Through a comparative analysis, ProteinFlow demonstrated a significant improvement over traditional methods, notably reducing data preprocessing time and expanding the scope of biologically significant features identified. The framework's parallel data processing strategy and advanced algorithms ensure not only rapid data handling but also the extraction of comprehensive, meaningful insights from protein sequences, structures, and interactions. Furthermore, ProteinFlow exhibits remarkable scalability, adeptly managing large-scale data sets without compromising performance, a crucial attribute in the era of big data.
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : Science Foundation Ireland
ID : 18/CRT/6223
Pays : Ireland
Informations de copyright
© 2024 The Author(s). Biotechnology and Bioengineering published by Wiley Periodicals LLC.
Références
Anurag, V. N. (2018). Distributed computing with Go: Practical concurrency and parallelism for Go applications. Packt Publishing Ltd.
Bawono, P., Dijkstra, M., Pirovano, W., Feenstra, A., Abeln, S., & Heringa, J. (2017). Multiple sequence alignment. In J. M. Keith (Ed.), Bioinformatics: Volume I: Data, sequence analysis, and evolution (pp. 167–189). Humana.
Burley, S. K., Berman, H. M., Kleywegt, G. J., Markley, J. L., Nakamura, H., & Velankar, S. (2017). Protein Data Bank (PDB): The single global macromolecular structure archive. Protein Crystallography: Methods and Protocols, 1607, 627–641.
Chen, C., Huang, H., & Wu, C. H. (2017). Protein bioinformatics databases and resources. Protein Bioinformatics: From Protein Modifications and Networks to Proteomics, 1558, 3–39.
Ciamponi, F. E., Lovci, M. T., Cruz, P. R., & Massirer, K. B. (2018). BioFeatureFinder: Flexible, unbiased analysis of biological characteristics associated with genomic regions. bioRxiv. [Preprint] Available from: https://doi.org/10.1101/279612
DuBois, P. (2013). MySQL. Addison‐Wesley.
Frauenfelder, H. (2002). Proteins: Paradigms of complexity. Proceedings of the National Academy of Sciences of the United States of America, 99(Suppl. 1), 2479–2480.
Gasteiger, E., Hoogland, C., Gattiker, A., Duvaud, S., Wilkins, M., Appel, R., & Bairoch, A. (2005). Expasy ProtParam. Swiss Institute of Bioinformatics.
Guruprasad, K., Reddy, B. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition: A novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Engineering, Design and Selection, 4(2), 155–161.
Jin, X., Liao, Q., Wei, H., Zhang, J., & Liu, B. (2021). SMI‐BLAST: A novel supervised search framework based on PSI‐BLAST for protein remote homology detection. Bioinformatics, 37(7), 913–920.
Jones, P., Binns, D., Chang, H.‐Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A. F., Sangrador‐Vegas, A., Scheremetjew, M., Yong, S.‐Y., Lopez, R., & Hunter, S. (2014). InterProScan 5: Genome‐scale protein function classification. Bioinformatics, 30(9), 1236–1240.
Kastritis, P. L., Visscher, K. M., van Dijk, A. D., & Bonvin, A. M. (2013). Solvated protein‐protein docking using Kyte‐Doolittle‐based water preferences. Proteins: Structure, Function, and Bioinformatics, 81(3), 510–518.
Laskowski, R. A., Jabłońska, J., Pravda, L., Vařeková, R. S., & Thornton, J. M. (2018). PDBsum: Structural summaries of PDB entries. Protein Science, 27(1), 129–134.
Lee, Y., & Lim, Y.‐H. (2022). Concurrency processing comparison of large data list using GO language. The Journal of the Convergence on Culture Technology, 8(2), 361–366.
Mi, Y., Marcu, S.‐B., Tabirca, S., & Yallapragada, V. V. (2024). PS‐GO parametric protein search engine. Computational and Structural Biotechnology Journal, 23, 1499–1509.
O'Boyle, N. M., Banck, M., James, C. A., Morley, C., Vandermeersch, T., & Hutchison, G. R. (2011). Open Babel: An open chemical toolbox. Journal of Cheminformatics, 3(1), 1–14.
Ofer, D., & Linial, M. (2015). ProFET: Feature engineering captures high‐level protein functions. Bioinformatics, 31(21), 3429–3436.
O'Leary, N. A., Wright, M. W., Brister, J. R., Ciufo, S., Haddad, D., McVeigh, R., Rajput, B., Robbertse, B., Smith‐White, B., Ako‐Adjei, D., Astashyn, A., Badretdin, A., Bao, Y., Blinkova, O., Brover, V., Chetvernin, V., Choi, J., Cox, E., Ermolaeva, O., … Pruitt, K. D. (2016). Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Research, 44(D1), D733–D745.
Ozols, J. (1990). [44] Amino acid analysis. Methods in Enzymology, 182, 587–601.
Pollastri, G., Baldi, P., Fariselli, P., & Casadio, R. (2002). Prediction of coordination number and relative solvent accessibility in proteins. Proteins: Structure, Function, and Bioinformatics, 47(2), 142–153.
Smith, M. H. (1966). The amino acid composition of proteins. Journal of Theoretical Biology, 13, 261–282.
Szklarczyk, D., Gable, A. L., Lyon, D., Junge, A., Wyder, S., Huerta‐Cepas, J., Simonovic, M., Doncheva, N. T., Morris, J. H., Bork, P., Jensen, L. J., & von Mering, C. (2019). STRING v11: Protein‐protein association networks with increased coverage, supporting functional discovery in genome‐wide experimental datasets. Nucleic Acids Research, 47(D1), D607–D613.
Tokmakov, A. A., Kurotani, A., & Sato, K.‐I. (2021). Protein pI and intracellular localization. Frontiers in Molecular Biosciences, 8, 775736.
UniProt Consortium. (2023). Uniprot: The universal protein knowledgebase in 2023. Nucleic Acids Research, 51(D1), D523–D531.
Van Oss, C. J. (1997). Hydrophobicity and hydrophilicity of biosurfaces. Current Opinion in Colloid & Interface Science, 2(5), 503–512.
Webb, B., & Sali, A. (2016). Comparative protein structure modeling using MODELLER. Current Protocols in Bioinformatics, 54(1), 5–6.
Williams, C. J., Headd, J. J., Moriarty, N. W., Prisant, M. G., Videau, L. L., Deis, L. N., Verma, V., Keedy, D. A., Hintze, B. J., Chen, V. B., Jain, S., Lewis, S. M., Arendall III, W. B., Snoeyink, J., Adams, P. D., Lovell, S. C., Richardson, J. S., & Richardson, D. C. (2018). MolProbity: More and better reference data for improved all‐atom structure validation. Protein Science, 27(1), 293–315.
Xia, X. (2007). Protein isoelectric point. In: Bioinformatics and the cell: Modern computational approaches in genomics, proteomics and transcriptomics (pp. 207–219). Springer.
Zhang, Y., & Skolnick, J. (2005). TM‐align: A protein structure alignment algorithm based on the TM‐score. Nucleic Acids Research, 33(7), 2302–2309.
Zheng, A., & Casari, A. (2018). Feature engineering for machine learning: Principles and techniques for data scientists. O'Reilly Media, Inc.