A normalized differential sequence feature encoding method based on amino acid sequences.

Humans Amino Acid Sequence Research Design Amino Acids Apoptosis Computer Systems

amino acid sequences dimensionality reduction methods integrated learning protein interactions sequence feature extraction

Journal

Mathematical biosciences and engineering : MBE

ISSN: 1551-0018

Titre abrégé: Math Biosci Eng

Pays: United States

ID NLM: 101197794

Informations de publication

Date de publication:
07 07 2023

Historique:

medline: 11 9 2023

pubmed: 8 9 2023

entrez: 7 9 2023

Statut: ppublish

Résumé

Protein interactions are the foundation of all metabolic activities of cells, such as apoptosis, the immune response, and metabolic pathways. In order to optimize the performance of protein interaction prediction, a coding method based on normalized difference sequence characteristics (NDSF) of amino acid sequences is proposed. By using the positional relationships between amino acids in the sequences and the correlation characteristics between sequence pairs, NDSF is jointly encoded. Using principal component analysis (PCA) and local linear embedding (LLE) dimensionality reduction methods, the coded 174-dimensional human protein sequence vector is extracted using sequence features. This study compares the classification performance of four ensemble learning methods (AdaBoost, Extra trees, LightGBM, XGBoost) applied to PCA and LLE features. Cross-validation and grid search methods are used to find the best combination of parameters. The results show that the accuracy of NDSF is generally higher than that of the sequence matrix-based coding method (MOS) coding method, and the loss and coding time can be greatly reduced. The bar chart of feature extraction shows that the classification accuracy is significantly higher when using the linear dimensionality reduction method, PCA, compared to the nonlinear dimensionality reduction method, LLE. After classification with XGBoost, the model accuracy reaches 99.2%, which provides the best performance among all models. This study suggests that NDSF combined with PCA and XGBoost may be an effective strategy for classifying different human protein interactions.

Identifiants

DOI: 10.3934/mbe.2023659 PMID: 37679156

pubmed: 37679156

doi: 10.3934/mbe.2023659

doi:

Substances chimiques

Amino Acids 0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Pagination

14734-14755

A normalized differential sequence feature encoding method based on amino acid sequences.

Journal

Informations de publication

Résumé

Identifiants

Substances chimiques

Types de publication

Langues

Sous-ensembles de citation

Pagination

Auteurs

Xiaoman Zhao (X)

Xue Wang (X)

Zhou Jin (Z)

Rujing Wang (R)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH