Using Language Representation Learning Approach to Efficiently Identify Protein Complex Categories in Electron Transport Chain.


Journal

Molecular informatics
ISSN: 1868-1751
Titre abrégé: Mol Inform
Pays: Germany
ID NLM: 101529315

Informations de publication

Date de publication:
10 2020
Historique:
received: 03 03 2020
accepted: 26 06 2020
pubmed: 1 7 2020
medline: 1 7 2021
entrez: 30 6 2020
Statut: ppublish

Résumé

We herein proposed a novel approach based on the language representation learning method to categorize electron complex proteins into 5 types. The idea is stemmed from the the shared characteristics of human language and protein sequence language, thus advanced natural language processing techniques were used for extracting useful features. Specifically, we employed transfer learning and word embedding techniques to analyze electron complex sequences and create efficient feature sets before using a support vector machine algorithm to classify them. During the 5-fold cross-validation processes, seven types of sequence-based features were analyzed to find the optimal features. On an average, our final classification models achieved the accuracy, specificity, sensitivity, and MCC of 96 %, 96.1 %, 95.3 %, and 0.86, respectively on cross-validation data. For the independent test data, those corresponding performance scores are 95.3 %, 92.6 %, 94 %, and 0.87. We concluded that using feature extracted using these representation learning methods, the prediction performance of simple machine learning algorithm is on par with existing deep neural network method on the task of categorizing electron complexes while enjoying a much faster way for feature generation. Furthermore, the results also showed that the combination of features learned from the representation learning methods and sequence motif counts helps yield better performance.

Identifiants

pubmed: 32598045
doi: 10.1002/minf.202000033
doi:

Substances chimiques

Multiprotein Complexes 0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

e2000033

Informations de copyright

© 2020 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.

Références

H. Swalwell, D. M. Kirby, E. L. Blakely, A. Mitchell, R. Salemi, C. Sugiana, A. G. Compton, E. J. Tucker, B.-X. Ke, P. J. Lamont, Eur. J. Hum. Genet. 2011, 19, 769-775.
W. D. Parker Jr, S. J. Boyson, J. K. Parks, Ann. Neurol 1989, 26, 719-723.
S. C. Foti, I. Hargreaves, S. Carrington, A. P. Kiely, H. Houlden, J. L. Holton, Sci. Rep. 2019, 9, 1-12.
D. Cawthon, K. Beers, W. Bottje, Poult. Sci. 2001, 80, 474-484.
M. H. Saier Jr, C. V. Tran, R. D. Barabote, Nucleic Acids Res. 2006, 34, D181-D186.
S.-A. Chen, Y.-Y. Ou, T.-Y. Lee, M. M. Gromiha, Bioinformatics 2011, 27, 2062-2067.
N. K. Mishra, J. Chang, P. X. Zhao, PLoS One 2014, 9.
Q.-T. Ho, D.-V. Phan, Y.-Y. Ou, Anal. Biochem. 2019, 577, 73-81.
X. Ru, L. Li, Q. Zou, J. Proteome Res. 2019, 18, 2931-2939.
Y.-Y. Ou, J. Mol. Graphics Modell. 2017, 73, 166-178.
N. Q. K. Le, Q.-T. Ho, E. K. Y. Yapp, Y.-Y. Ou, H.-Y. Yeh, Neurocomputing 2020, 375, 71-79.
E. Asgari, M. R. Mofrad, PLoS One.
P. Ng, arXiv preprint arXiv:1701.06279 2017.
N. Q. K. Le, E. K. Y. Yapp, Q.-T. Ho, N. Nagasundaram, Y.-Y. Ou, H.-Y. Yeh, Anal. Biochem. 2019, 571, 53-61.
N. Q. K. Le, Mol. Genet. Genomics 2019, 294, 1173-1182.
D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, B. Qin, in Proc Conf Assoc Comput Linguist Meet (Volume 1: Long Papers), 2014, pp. 1555-1565.
W. Y. Zou, R. Socher, D. Cer, C. D. Manning, in Proc Conf Empir Methods Nat Lang Process, 2013, pp. 1393-1398.
G. Goth, ACM New York, NY, USA, 2016.
M. Habibi, L. Weber, M. Neves, D. L. Wiegandt, U. Leser, Bioinformatics 2017, 33, i37-i48.
G. Zhou, T. He, J. Zhao, P. Hu, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 250-259.
S. Salant, J. Berant, arXiv preprint arXiv:1712.03609 2017.
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, arXiv preprint arXiv:1802.05365 2018.
S. J. Pan, Q. Yang, IEEE Trans Knowl Data Eng 2009, 22, 1345-1359.
Y. Bengio, in Proceedings of ICML workshop on unsupervised and transfer learning, 2012, pp. 17-36.
W. Dai, Y. Chen, G.-R. Xue, Q. Yang, Y. Yu, Adv. Neural. Inf. Process. Syst. 2009, 353-360.
Nucleic Acids Res 2017, 45, D158-D169.
M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, Nat. Genet. 2000, 25, 25-29.
K.-C. Chou, J. Theor. Biol. 2011, 273, 236-247.
Q.-T. Ho, D.-V. Phan, Y.-Y. Ou, bioRxiv 2019, 860791.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Adv. Neural. Inf. Process Syst. 2013, 3111-3119.
P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Transactions of the Association for Computational Linguistics 2017, 5, 135-146.
A. Ben-Hur, D. Brutlag, in Feature extraction, Springer, 2006, pp. 625-645.
C. Mazo, J. Bernal, M. Trujillo, E. Alegre, Trans Assoc Comput Linguist 2018, 165, 69-76.
G. Liang, L. Zheng, Comput. Methods Programs Biomed. 2019, 104964.
B. Q. Huynh, H. Li, M. L. Giger, J. Med. Imaging (Bellingham) 2016, 3, 034501.
T. Joachims, in European conference on machine learning, Springer, 1998, pp. 137-142.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Mach. Learn. Res. 2011, 12, 2825-2830.
L. v d Maaten, G. Hinton, J. Mach. Learn. Res. 2008, 9, 2579-2605.
R. M. I. Kusuma, Y.-Y. Ou, J. Mol. Graphics Modell. 2019, 92, 86-93.
S. W. Taju, T.-T.-D. Nguyen, N.-Q.-K. Le, R. M. I. Kusuma, Y.-Y. Ou, Bioinformatics 2018, 34, 3111-3117.
N. Q. K. Le, E. K. Y. Yapp, H.-Y. Yeh, BMC Bioinf. 2019, 20, 377.
K. Chen, M. J. Mizianty, L. Kurgan, Bioinformatics 2012, 28, 331-341.
N. Q. K. Le, T.-T. Huynh, E. K. Y. Yapp, H.-Y. Yeh, Comput. Methods Programs Biomed 2019, 177, 81-88.
S. W. Taju, Y. Y. Ou, J. Comput. Chem. 2019, 40, 1521-1529.
S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, Nucleic Acids Res. 1997, 25, 3389-3402.
M. Remmert, A. Biegert, A. Hauser, J. Söding, Nat. Methods 2012, 9, 173.
B. E. Suzek, Y. Wang, H. Huang, P. B. McGarvey, C. H. Wu, U. Consortium, Bioinformatics 2015, 31, 926-932.
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, arXiv preprint arXiv:1810.04805 2018.

Auteurs

Trinh-Trung-Duong Nguyen (TT)

Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan, 32003.

Nguyen-Quoc-Khanh Le (NQ)

Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City, 106, Taiwan.
Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City, 106, Taiwan.

Quang-Thai Ho (QT)

Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan, 32003.

Dinh-Van Phan (DV)

University of Economics, University of Danang, 41 Leduan St, Danang City, 550000, Vietnam.

Yu-Yen Ou (YY)

Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan, 32003.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH