A normalized differential sequence feature encoding method based on amino acid sequences.
amino acid sequences
dimensionality reduction methods
integrated learning
protein interactions
sequence feature extraction
Journal
Mathematical biosciences and engineering : MBE
ISSN: 1551-0018
Titre abrégé: Math Biosci Eng
Pays: United States
ID NLM: 101197794
Informations de publication
Date de publication:
07 07 2023
07 07 2023
Historique:
medline:
11
9
2023
pubmed:
8
9
2023
entrez:
7
9
2023
Statut:
ppublish
Résumé
Protein interactions are the foundation of all metabolic activities of cells, such as apoptosis, the immune response, and metabolic pathways. In order to optimize the performance of protein interaction prediction, a coding method based on normalized difference sequence characteristics (NDSF) of amino acid sequences is proposed. By using the positional relationships between amino acids in the sequences and the correlation characteristics between sequence pairs, NDSF is jointly encoded. Using principal component analysis (PCA) and local linear embedding (LLE) dimensionality reduction methods, the coded 174-dimensional human protein sequence vector is extracted using sequence features. This study compares the classification performance of four ensemble learning methods (AdaBoost, Extra trees, LightGBM, XGBoost) applied to PCA and LLE features. Cross-validation and grid search methods are used to find the best combination of parameters. The results show that the accuracy of NDSF is generally higher than that of the sequence matrix-based coding method (MOS) coding method, and the loss and coding time can be greatly reduced. The bar chart of feature extraction shows that the classification accuracy is significantly higher when using the linear dimensionality reduction method, PCA, compared to the nonlinear dimensionality reduction method, LLE. After classification with XGBoost, the model accuracy reaches 99.2%, which provides the best performance among all models. This study suggests that NDSF combined with PCA and XGBoost may be an effective strategy for classifying different human protein interactions.
Substances chimiques
Amino Acids
0
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM