PortPred: Exploiting deep learning embeddings of amino acid sequences for the identification of transporter proteins and their substrates.
membrane proteins
pre-trained embeddings
protein sequence embeddings
substrates prediction
transporter proteins
Journal
Journal of cellular biochemistry
ISSN: 1097-4644
Titre abrégé: J Cell Biochem
Pays: United States
ID NLM: 8205768
Informations de publication
Date de publication:
Nov 2023
Nov 2023
Historique:
revised:
29
09
2023
received:
10
07
2023
accepted:
03
10
2023
medline:
27
11
2023
pubmed:
25
10
2023
entrez:
25
10
2023
Statut:
ppublish
Résumé
The physiology of every living cell is regulated at some level by transporter proteins which constitute a relevant portion of membrane-bound proteins and are involved in the movement of ions, small and macromolecules across bio-membranes. The importance of transporter proteins is unquestionable. The prediction and study of previously unknown transporters can lead to the discovery of new biological pathways, drugs and treatments. Here we present PortPred, a tool to accurately identify transporter proteins and their substrate starting from the protein amino acid sequence. PortPred successfully combines pre-trained deep learning-based protein embeddings and machine learning classification approaches and outperforms other state-of-the-art methods. In addition, we present a comparison of the most promising protein sequence embeddings (Unirep, SeqVec, ProteinBERT, ESM-1b) and their performances for this specific task.
Substances chimiques
Membrane Transport Proteins
0
Membrane Proteins
0
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
1803-1824Subventions
Organisme : HORIZON EUROPE Marie Sklodowska-Curie Actions
Organisme : European Union's Horizon 2020 research and innovation program
ID : 812968
Informations de copyright
© 2023 The Authors. Journal of Cellular Biochemistry published by Wiley Periodicals LLC.
Références
Balch WE, Dunphy WG, Braell WA, Rothman JE. Reconstitution of the transport of protein between successive compartments of the Golgi measured by the coupled incorporation of N-acetylglucosamine. Cell. 1984; 39: 405-416.
Kaiser CA, Schekman R. Distinct sets of SEC genes govern transport vesicle formation and fusion early in the secretory pathway. Cell. 1990; 61: 723-733.
Hata Y, Slaughter CA, Sü dhof TC. Synaptic vesicle fusion complex contains unc-18 homologue bound to syntaxin. Nature. 1993; 366: 347-351.
Benga G. Water channel proteins: from their discovery in Cluj-Napoca, Romania in 1985, to the 2003 Nobel Prize in chemistry and their implications in molecular medicine. Keio J Med. 2006; 55: 64-69.
Hediger MA, Clémençon B, Burrier RE, Bruford EA. The ABCs of membrane transporters in health and disease (SLC series): introduction. Mol Aspects Med. 2013; 34: 95-107.
Sahoo S, Aurich M, Jonsson J, Thiele I. Membrane transporters in a human genome-scale metabolic knowledgebase and their implications for disease. Front physiol. 2014; 5: 91.
Robey RW, Pluchino KM, Hall MD, et al. Revisiting the role of ABC transporters in multidrug-resistant cancer. Nat Rev Cancer. 2018; 18: 452-464.
Dahl SG, Sylte I, Ravna AW. Structures and models of transporter proteins. J Pharmacol Exp Ther. 2004; 309: 853-860.
Saier MH. A functional-phylogenetic classification system for transmembrane solute transporters. Microbiol Mol Biol Rev. 2000; 64: 354-411.
Busch W, Saier MH. The transporter classification (TC) system, 2002. Crit Rev Biochem Mol Bio. 2002; 37: 287-337.
Saier Jr. MH, Reddy VS, Moreno-Hagelsieb G, et al. The transporter classification database (TCDB): 2021 update. Nucleic Acids Res. 2020; 49: D461-D467.
Berman HM. The protein data bank. Nucleic Acids Res. 2000; 28: 235-242.
Mishra NK, Chang J, Zhao PX. Prediction of membrane transport proteins and their substrate specificities using primary sequence information. PLoS ONE. 2014; 9: e100278.
Liou YF, Vasylenko T, Yeh CL, et al. SCMMTP: identifying and characterizing membrane transport proteins using propensity scores of dipeptides. BMC Genomics. 2015; 16: S6.
Li L, Li J, Xiao W, et al. Prediction the substrate specificities of membrane transport proteins based on support vector machine and hybrid features. IEEE/ACM Trans Comput Biol Bioinform. 2016; 13: 947-953.
Nguyen TTD, Le NQK, Ho QT, et al. Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. Anal Biochem. 2019; 577: 73-81.
Alballa M, Butler G. TooT-T: discrimination of transport proteins from non-transport proteins. BMC Bioinformatics. 2020; 21: 25.
Ghazikhani H, Butler G. TooT-BERT-T: A BERT approach on discriminating transport proteins from non-transport proteins. In Practical Applications of Computational Biology and Bioinformatics, 16th International Conference (PACBB 2022). Springer International Publishing; 2022: 1-11.
Alballa M, Aplop F, Butler G. TranCEP: predicting the substrate class of transmembrane transport proteins using compositional, evolutionary, and positional information. PLOS ONE. 2020; 15: e0227683.
Alley EC, Khimulya G, Biswas S, et al. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019; 16: 1315-1322.
Heinzinger M, Elnaggar A, Wang Y, et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics. 2019; 20: 723.
Anteghini M, dos Santos VAM, Saccenti E. In-Pero: exploiting deep learning embeddings of protein sequences to predict the localisation of peroxisomal proteins. Int. J. Mol. 2021; 22: 6409.
Rives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA. 2021;118:e2016239118.
Nambiar A, Heflin M, Liu S, et al. Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB '20. Association for Computing Machinery, New York, NY, USA. ISBN 9781450379649.
Elnaggar A, Heinzinger M, Dallago C, et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans Pattern Anal Mach Intell. 2021; 1.
Anteghini M, Haja A, dos Santos VAM, et al. OrganelX web server for sub-peroxisomal and sub-mitochondrial protein localization and peroxisomal target signal detection. Comput Struct Biotechnol J. 2022; 21: 128-133.
Vastermark A, Wollwage S, Houle ME, et al. Expansion of the APC superfamily of secondary carriers. Proteins. 2014; 82: 2797-2811.
Nigam SK, Bush KT, Martovetsky G, et al. The organic anion transporter (OAT) family: a systems biology perspective. Physiol Rev. 2015; 95: 83-123.
UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2020;49:D480-D489.
Lyall F. Biochemistry. In Basic Science in Obstetrics and Gynaecology. Elsevier; 2010: 143-171.
Huang Y, Anderle P, Bussey KJ, et al. Membrane transporters and channels. Cancer Res. 2004; 64: 4294-4301.
Mueckler M, Caruso C, Baldwin SA, et al. Sequence and structure of a human glucose transporter. Science. 1985; 229: 941-945.
Ristovski M, Farhat D, Bancud SEM, Lee JY. Lipid transporters beam signals from cell membranes. Membranes. 2021; 11: 562.
Ma Z, Jacobsen FE, Giedroc DP. Coordination chemistry of bacterial metal transport and sensing. Chem Rev. 2009; 109: 4644-4681.
Agarwal S, Mishra NK, Singh H, Raghava GP. Identification of mannose interacting residues using local composition. PloS one. 2011; 6: e24039.
Chen SA, Ou YY, Lee TY, Gromiha MM. Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties. Bioinformatics. 2011; 27: 2062-2067.
Kawashima S, Pokarowski P, Pokarowska M, et al. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2007; 36: D202-D205.
Attwood T. Profile (Position-Specific Scoring Matrix, Position Weight Matrix, PSSM, Weight Matrix). American Cancer Society; 2004. ISBN 9780471650126.
Stormo GD, Schneider TD, Gold L, Ehrenfeucht A. Use of the ‘perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 1982; 10: 2997-3011.
Altschul S, Madden T, Shaffer A, et al. Gapped blast and psi-blast: a new generation of protein database search programs. Nucl Acids Res. 1996; 25: 3389-3402.
Suzek BE, Wang Y, Huang H, et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2014; 31: 926-932.
Boughaci D, Benhamou B, Drias H. IGA: an improved genetic algorithm for MAX-SAT problems. In: Prasad B, ed. Proceedings of the 3rd Indian International Conference on Artificial Intelligence, Pune, India, December 17-19, 2007. IICAI; 2007: 132-150.
Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997; 30: 1145-1159.
Li ZR, Lin HH, Han LY, et al. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2006; 34: W32-W37.
Guthrie D, Allison B, Liu W, et al. A closer look at skip-gram modelling. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06). European Language Resources Association (ELRA), Genoa, Italy; 2006.
Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space, 2013.
Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. COLT '92. Association for Computing Machinery, New York, NY, USA. 1992: 144-152. ISBN 089791497X.
Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol. 1990; 215: 403-410.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018.
Cramer J. The origins of logistic regression. Tinbergen Institute, Tinbergen Institute Discussion Papers. 2002.
Tommaso PD, Moretti S, Xenarios I, et al. T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic Acids Res. 2011; 39: W13-W17.
Chang JM, Tommaso PD, Notredame C. TCS: A new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol Biol Evol. 2014; 31: 1625-1637.
Alballa M. Predicting Transporter Proteins and Their Substrate Specificity. Ph.D. thesis, Concordia University, 2020. Unpublished.
Alballa M, Butler G. TooT-SC: predicting eleven substrate classes of transmembrane transport proteins. bioRxiv. 2022.
Peters ME, Neumann M, Iyyer M, et al. Deep contextualized word representations. In Proc. of NAACL. 2018.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997; 9: 1735-1780.
Brandes N, Ofer D, Peleg Y, et al. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022;38:2102-2110.
Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the unification of biology. Nature Genetics. 2000; 25: 25-29.
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process. 2017:5998-6008.
Harris ZS. Distributional structure. Word. 1954; 10: 146-162.
Alballa M, Butler G. Ontology-based transporter substrate annotation for benchmark datasets. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2019: 2613-2619.
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006; 22: 1658-1659.
Li Y, Ilie L. SPRINT: ultrafast protein-protein interaction prediction of the entire human interactome. BMC Bioinformatics. 2017; 18: 485.
Cristianini N, Ricci E. Support Vector Machines. Springer US, 2008: 928-932. ISBN 978-0-387-30162-4.
Seliya N, Zadeh AA, Khoshgoftaar TM. A literature review on one-class classification and its potential applications in big data. J Big Data. 2021; 8: 122.
Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998; 20: 832-844.
Breiman L. Random forests. Mach Learn. 2001; 45: 5-32.
Murtagh F. Multilayer perceptrons for classification and regression. Neurocomputing. 1991; 2: 183-197.
Linnainmaa S. Taylor expansion of the accumulated rounding error. BIT. 1976; 16: 146-160.
Fukushima K. Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern. 1980; 36: 193-202.
Tolles J, Meurer WJ. Logistic regression. JAMA. 2016; 316: 533.
Stone M. Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Series B Stat Methodol. 1974; 36: 111-133.
Harris CR, Millman KJ, van der Walt SJ, et al. Array programming with NumPy. Nature. 2020; 585: 357-362.
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. J Mac Learn Res. 2011; 12: 2825-2830.
Rijsbergen CJV. Information Retrieval. 2nd ed. Butterworth-Heinemann; 1979.
Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975; 405: 442-451.
Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PloS one. 2017; 12: e0177678.
Melo F. Area under the ROC curve. In Encyclopedia of Systems Biology. Springer 2013: 38-39.
Saccenti E, Hendriks MHWB, Smilde AK Corruption of the Pearson correlation coefficient by measurement error and its estimation, bias, and correction under different error models. Sci Rep. 2020; 10: 438.