ILMCNet: A Deep Neural Network Model That Uses PLM to Process Features and Employs CRF to Predict Protein Secondary Structure.
conditional random field
hybrid neural network architecture
protein secondary structure
unsupervised protein language model
Journal
Genes
ISSN: 2073-4425
Titre abrégé: Genes (Basel)
Pays: Switzerland
ID NLM: 101551097
Informations de publication
Date de publication:
21 Oct 2024
21 Oct 2024
Historique:
received:
07
09
2024
revised:
07
10
2024
accepted:
18
10
2024
medline:
26
10
2024
pubmed:
26
10
2024
entrez:
26
10
2024
Statut:
epublish
Résumé
Protein secondary structure prediction (PSSP) is a critical task in computational biology, pivotal for understanding protein function and advancing medical diagnostics. Recently, approaches that integrate multiple amino acid sequence features have gained significant attention in PSSP research. We aim to automatically extract additional features represented by evolutionary information from a large number of sequences while simultaneously incorporating positional information for more comprehensive sequence features. Additionally, we consider the interdependence between secondary structures during the prediction stage. To this end, we propose a deep neural network model, ILMCNet, which utilizes a language model and Conditional Random Field (CRF). Protein language models (PLMs) pre-trained on sequences from multiple large databases can provide sequence features that incorporate evolutionary information. ILMCNet uses positional encoding to ensure that the input features include positional information. To better utilize these features, we propose a hybrid network architecture that employs a Transformer Encoder to enhance features and integrates a feature extraction module combining a Convolutional Neural Network (CNN) with a Bidirectional Long Short-Term Memory Network (BiLSTM). This design enables deep extraction of localized features while capturing global bidirectional information. In the prediction stage, ILMCNet employs CRF to capture the interdependencies between secondary structures. Experimental results on benchmark datasets such as CB513, TS115, NEW364, CASP11, and CASP12 demonstrate that the prediction performance of our method surpasses that of comparable approaches. This study proposes a new approach to PSSP research and is expected to play an important role in other protein-related research fields, such as protein tertiary structure prediction.
Sections du résumé
BACKGROUND
BACKGROUND
Protein secondary structure prediction (PSSP) is a critical task in computational biology, pivotal for understanding protein function and advancing medical diagnostics. Recently, approaches that integrate multiple amino acid sequence features have gained significant attention in PSSP research.
OBJECTIVES
OBJECTIVE
We aim to automatically extract additional features represented by evolutionary information from a large number of sequences while simultaneously incorporating positional information for more comprehensive sequence features. Additionally, we consider the interdependence between secondary structures during the prediction stage.
METHODS
METHODS
To this end, we propose a deep neural network model, ILMCNet, which utilizes a language model and Conditional Random Field (CRF). Protein language models (PLMs) pre-trained on sequences from multiple large databases can provide sequence features that incorporate evolutionary information. ILMCNet uses positional encoding to ensure that the input features include positional information. To better utilize these features, we propose a hybrid network architecture that employs a Transformer Encoder to enhance features and integrates a feature extraction module combining a Convolutional Neural Network (CNN) with a Bidirectional Long Short-Term Memory Network (BiLSTM). This design enables deep extraction of localized features while capturing global bidirectional information. In the prediction stage, ILMCNet employs CRF to capture the interdependencies between secondary structures.
RESULTS
RESULTS
Experimental results on benchmark datasets such as CB513, TS115, NEW364, CASP11, and CASP12 demonstrate that the prediction performance of our method surpasses that of comparable approaches.
CONCLUSIONS
CONCLUSIONS
This study proposes a new approach to PSSP research and is expected to play an important role in other protein-related research fields, such as protein tertiary structure prediction.
Identifiants
pubmed: 39457474
pii: genes15101350
doi: 10.3390/genes15101350
pii:
doi:
Substances chimiques
Proteins
0
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : Key R&D Program of Heilongjiang Province
ID : 2022ZX01A29
Organisme : National Natural Science Foundation of China
ID : 62272095