Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
07 Sep 2024
Historique:
received: 18 12 2023
accepted: 29 08 2024
medline: 8 9 2024
pubmed: 8 9 2024
entrez: 7 9 2024
Statut: epublish

Résumé

DNA-protein interactions exert the fundamental structure of many pivotal biological processes, such as DNA replication, transcription, and gene regulation. However, accurate and efficient computational methods for identifying these interactions are still lacking. In this study, we propose a method ESM-DBP through refining the DNA-binding protein sequence repertory and domain-adaptive pretraining based the general protein language model. Our method considers the lacking exploration of general language model for DNA-binding protein domain-specific knowledge, so we screen out 170,264 DNA-binding protein sequences to construct the domain-adaptive language model. Experimental results on four downstream tasks show that ESM-DBP provides a better feature representation of DNA-binding protein compared to the original language model, resulting in improved prediction performance and outperforming the state-of-the-art methods. Moreover, ESM-DBP can still perform well even for those sequences with only a few homologous sequences. ChIP-seq on two predicted cases further support the validity of the proposed method.

Identifiants

pubmed: 39244557
doi: 10.1038/s41467-024-52293-7
pii: 10.1038/s41467-024-52293-7
doi:

Substances chimiques

DNA-Binding Proteins 0
DNA 9007-49-2

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

7838

Informations de copyright

© 2024. The Author(s).

Références

Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).
pubmed: 29425488 doi: 10.1016/j.cell.2018.01.029
Lazarian, G. et al. A hotspot mutation in transcription factor IKZF3 drives B cell neoplasia via transcriptional dysregulation. Cancer Cell 39, 380–393.e388 (2021).
pubmed: 33689703 pmcid: 8034546 doi: 10.1016/j.ccell.2021.02.003
Lu, T. et al. REST and stress resistance in ageing and Alzheimer’s disease. Nature 507, 448–454 (2014).
pubmed: 24670762 pmcid: 4110979 doi: 10.1038/nature13163
Esmaeeli, R., Bauzá, A. & Perez, A. Structural predictions of protein–DNA binding: MELD-DNA. Nucleic Acids Res. 51, 1625–1636 (2023).
pubmed: 36727436 pmcid: 9976882 doi: 10.1093/nar/gkad013
Shandar, A. et al. Integrating sequence and gene expression information predicts genome-wide DNA-binding proteins and suggests a cooperative mechanism. Nucleic Acids Res. 46, 54–70 (2017).
Baek, M. et al. Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat. Methods 21, 117–121 (2024).
pubmed: 37996753 doi: 10.1038/s41592-023-02086-5
Bateman, A. et al. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
doi: 10.1093/nar/gky1049
Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
pubmed: 34232869 doi: 10.1109/TPAMI.2021.3095381
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).
pubmed: 36702895 pmcid: 10400306 doi: 10.1038/s41587-022-01618-2
Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).
pubmed: 35896542 pmcid: 9329459 doi: 10.1038/s41467-022-32007-7
Lin, P., Tao, H., Li, H. & Huang, S.-Y. Protein–protein contact prediction by geometric triangle-aware protein language models. Nat. Mach. Intell. 5, 1275–1284 (2023).
doi: 10.1038/s42256-023-00741-2
Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42, 275–283 (2023).
pubmed: 37095349 pmcid: 10869273 doi: 10.1038/s41587-023-01763-2
Brandes, N., Goldman, G., Wang, C. H., Ye, C. J. & Ntranos, V. Genome-wide prediction of disease variant effects with a deep protein language model. Nat. Genet. 55, 15121–11522 (2023).
doi: 10.1038/s41588-023-01465-0
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
pubmed: 33876751 pmcid: 8053943 doi: 10.1073/pnas.2016239118
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
pubmed: 36927031 doi: 10.1126/science.ade2574
Wang, W., Peng, Z. & Yang, J. Single-sequence protein structure prediction using supervised transformer protein language models. Nat. Comput. Sci. 2, 804–814 (2022).
pubmed: 38177395 doi: 10.1038/s43588-022-00373-3
Chowdhury, R. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022).
pubmed: 36192636 pmcid: 10440047 doi: 10.1038/s41587-022-01432-w
Fang, X. et al. A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nat. Mach. Intell. 5, 1087–1096 (2023).
doi: 10.1038/s42256-023-00721-6
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
pubmed: 34265844 pmcid: 8371605 doi: 10.1038/s41586-021-03819-2
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
pubmed: 34282049 pmcid: 7612213 doi: 10.1126/science.abj8754
Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
pubmed: 25398609 doi: 10.1093/bioinformatics/btu739
Liu, Y. & Tian, B. Protein-DNA binding sites prediction based on pre-trained protein language model and contrastive learning. Brief. Bioinforma. 25, bbad488 (2024).
doi: 10.1093/bib/bbad488
Zhu, Y.-H., Liu, Z., Liu, Y., Ji, Z. & Yu, D.-J. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein–DNA binding site prediction. Brief. Bioinforma. 25, bbae040 (2024).
doi: 10.1093/bib/bbae040
Rao, R. M. et al. MSA transformer. Proc. 38th Int. Conf. Mach. Learn. PMLR 139, 8844–8856 (2021).
Roche, R., Moussad, B., Shuvo, M. H., Tarafder, S. & Bhattacharya, D. EquiPNAS: improved protein–nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. Nucleic Acids Res. 52, e27 (2024).
pubmed: 38281252 pmcid: 10954458 doi: 10.1093/nar/gkae039
Zeng, W. et al. ESM-NBR: fast and accurate nucleic acid-binding residue prediction via protein language model feature representation and multi-task learning. 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 76–81 (2023).
Gururangan, S. et al. Don’t stop pretraining: adapt language models to domains and tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8342–8360 (2020).
Hu, J. et al. Improving DNA-binding protein prediction using three-part sequence-order feature extraction and a deep neural network algorithm. J. Chem. Inf. Model 63, 1044–1057 (2023).
pubmed: 36719781 doi: 10.1021/acs.jcim.2c00943
Zeng, W. et al. LBi-DBP, an accurate DNA-binding protein prediction method based lightweight interpretable BiLSTM network. Expert Syst. Appl. 249, 123525 (2024).
doi: 10.1016/j.eswa.2024.123525
Hu, J. et al. Protein-DNA binding residue prediction via bagging strategy and sequence-based cube-format feature. IEEE-Acm Trans. Comput. Biol. Bioinforma. 19, 3635–3645 (2022).
doi: 10.1109/TCBB.2021.3123828
Kim, G. B., Gao, Y., Palsson, B. O. & Lee, S. Y. DeepTFactor: a deep learning-based tool for the prediction of transcription factors. Proc. Natl Acad. Sci. USA 118, e2021171118 (2021).
pubmed: 33372147 doi: 10.1073/pnas.2021171118
Aizenshtein-Gazit, S. & Orenstein, Y. J. B. DeepZF: improved DNA-binding prediction of C2H2-zinc-finger proteins by deep transfer learning. Bioinformatics 38, ii62–ii67 (2022).
pubmed: 36124796 doi: 10.1093/bioinformatics/btac469
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
pubmed: 23060610 pmcid: 3516142 doi: 10.1093/bioinformatics/bts565
Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175 (2012).
doi: 10.1038/nmeth.1818
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
pubmed: 9254694 pmcid: 146917 doi: 10.1093/nar/25.17.3389
Park, P. J. ChIP–seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680 (2009).
pubmed: 19736561 pmcid: 3191340 doi: 10.1038/nrg2641
Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).
pubmed: 27899574 doi: 10.1093/nar/gkw1081
Du, X., Diao, Y., Liu, H. & Li, S. MsDBP: exploring DNA-binding proteins by integrating multiscale sequence information via Chou’s five-step rule. J. Proteome Res. 18, 3119–3132 (2019).
pubmed: 31267738 doi: 10.1021/acs.jproteome.9b00226
Chowdhury, S. Y., Shatabda, S. & Dehzangi, A. iDNAProt-ES: identification of DNA-binding proteins using evolutionary and structural features. Sci. Rep. 7, 14938 (2017).
pubmed: 29097781 pmcid: 5668250 doi: 10.1038/s41598-017-14945-1
Hu, J., Rao, L., Zhu, Y. H., Zhang, G. J. & Yu, D. J. TargetDBP + : enhancing the performance of identifying DNA-binding proteins via weighted convolutional features. J. Chem. Inf. Model 61, 505–515 (2021).
pubmed: 33410688 doi: 10.1021/acs.jcim.0c00735
Zhang, J., Chen, Q. C. & Liu, B. iDRBP_MMC: identifying DNA-binding proteins and RNA-binding proteins based on multi-label learning model and motif-based convolutional neural network. J. Mol. Biol. 432, 5860–5875 (2020).
pubmed: 32920048 doi: 10.1016/j.jmb.2020.09.008
Feng, J., Wang, N., Zhang, J. & Liu, B. iDRBP-ECHF: identifying DNA- and RNA-binding proteins based on extensible cubic hybrid framework. Comput. Biol. Med. 149, 105940 (2022).
pubmed: 36044786 doi: 10.1016/j.compbiomed.2022.105940
Wang, N., Zhang, J. & Liu, B. IDRBP-PPCT: identifying nucleic acid-binding proteins based on position-specific score matrix and position-specific frequency matrix cross transformation. IEEE/ACM Trans. Comput. Biol. Bioinforma. 19, 2284–2293 (2022).
doi: 10.1109/TCBB.2021.3069263
Yan, K., Feng, J., Huang, J. & Wu, H. iDRPro-SC: identifying DNA-binding proteins and RNA-binding proteins based on subfunction classifiers. Brief. Bioinforma. 24, bbad251 (2023).
doi: 10.1093/bib/bbad251
Zhang, J., Chen, Q. C. & Liu, B. NCBRPred: predicting nucleic acid binding residues in proteins based on multilabel learning. Brief. Bioinforma. 22, bbaa397 (2021).
doi: 10.1093/bib/bbaa397
Wang, N., Yan, K., Zhang, J. & Liu, B. iDRNA-ITF: identifying DNA- and RNA-binding residues in proteins based on induction and transfer framework. Brief. Bioinforma. 23, bbac236 (2022).
doi: 10.1093/bib/bbac236
Yan, J. & Kurgan, L. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues. Nucleic Acids Res. 45, e84 (2017).
pubmed: 28132027 pmcid: 5449545
Nguyen, B. P., Nguyen, Q. H., Doan-Ngoc, G.-N., Nguyen-Vo, T.-H. & Rahardja, S. iProDNA-CapsNet: identifying protein-DNA binding residues using capsule neural networks. BMC Bioinforma. 20, 634 (2019).
doi: 10.1186/s12859-019-3295-2
Hu, J. et al. Predicting protein-DNA binding residues by weightedly combining sequence-based features and boosting multiple SVMs. IEEE/ACM Trans. Comput. Biol. Bioinforma. 14, 1389–1398 (2016).
doi: 10.1109/TCBB.2016.2616469
Zhu, Y. H., Hu, J., Song, X. N. & Yu, D. J. DNAPred: accurate identification of DNA-binding sites from protein sequence by ensembled hyperplane-distance-based support vector machines. J. Chem. Inf. Model 59, 3057–3071 (2019).
pubmed: 30943723 doi: 10.1021/acs.jcim.8b00749
Xia, Y., Xia, C. Q., Pan, X. Y. & Shen, H. B. GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res. 49, e51 (2021).
pubmed: 33577689 pmcid: 8136796 doi: 10.1093/nar/gkab044
Liu, M.-L. et al. Predicting preference of transcription factors for methylated DNA using sequence information. Mol. Ther.-Nucleic Acids 22, 1043–1050 (2020).
pubmed: 33294291 pmcid: 7691157 doi: 10.1016/j.omtn.2020.07.035
Li, H., Gong, Y., Liu, Y., Lin, H. & Wang, G. Detection of transcription factors binding to methylated DNA by deep recurrent neural network. Brief. Bioinforma. 23, bbab533 (2022).
doi: 10.1093/bib/bbab533
Sundararajan, M., Taly, A. & Yan, Q. Q. Axiomatic attribution for deep networks. Proceedings of the 34th International Conference on Machine Learning. PMLR, 70, 3319–3328 (2017).
Brodsky, S., Jana, T. & Barkai, N. Order through disorder: the role of intrinsically disordered regions in transcription factor binding specificity. Curr. Opin. Struct. Biol. 71, 110–115 (2021).
pubmed: 34303077 doi: 10.1016/j.sbi.2021.06.011
Kumar, D. K. et al. Complementary strategies for directing in vivo transcription factor binding through DNA binding domains and intrinsically disordered regions. Mol. Cell 83, 1462–1473. e1465 (2023).
pubmed: 37116493 doi: 10.1016/j.molcel.2023.04.002
Peng, Z. & Kurgan, L. High-throughput prediction of RNA, DNA and protein binding regions mediated by intrinsic disorder. Nucleic Acids Res. 43, e121 (2015).
pubmed: 26109352 pmcid: 4605291 doi: 10.1093/nar/gkv585
Wang, X. et al. Negatively charged, intrinsically disordered regions can accelerate target search by DNA-binding proteins. Nucleic Acids Res. 51, 4701–4712 (2023).
pubmed: 36774964 pmcid: 10250230 doi: 10.1093/nar/gkad045
Zhang, F., Li, M., Zhang, J. & Kurgan, L. HybridRNAbind: prediction of RNA interacting residues across structure-annotated and disorder-annotated proteins. Nucleic Acids Res. 51, e25 (2023).
pubmed: 36629262 pmcid: 10018345 doi: 10.1093/nar/gkac1253
Silva, L. A., Loregian, A., Pari, G. S., Strang, B. L. & Coen, D. M. The carboxy-terminal segment of the human cytomegalovirus DNA polymerase accessory subunit UL44 is crucial for viral replication. J. Virol. 84, 11563–11568 (2010).
pubmed: 20739543 pmcid: 2953201 doi: 10.1128/JVI.01033-10
Zheng, R. et al. Cistrome data browser: expanded datasets and new tools for gene regulatory analysis. Nucleic Acids Res. 47, D729–D735 (2018).
pmcid: 6324081 doi: 10.1093/nar/gky1094
Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
pubmed: 18798982 pmcid: 2592715 doi: 10.1186/gb-2008-9-9-r137
Zhang, Y. et al. GTF2E2 is a novel biomarker for recurrence after surgery and promotes progression of esophageal squamous cell carcinoma via miR-139-5p/GTF2E2/FUS axis. Oncogene 41, 782–796 (2022).
pubmed: 34853466 doi: 10.1038/s41388-021-02122-8
Bi, G. et al. Knockdown of GTF2E2 inhibits the growth and progression of lung adenocarcinoma via RPS4X in vitro and in vivo. Cancer Cell Int. 21, 181 (2021).
pubmed: 33757492 pmcid: 7989205 doi: 10.1186/s12935-021-01878-z
Qiao, X. et al. GTF2E2 downregulated by miR-340-5p inhibits the malignant progression of glioblastoma. Cancer Gene Ther. 30, 1702–1714 (2023).
pubmed: 37845349 doi: 10.1038/s41417-023-00676-1
Mahajan, K. & Mahajan, N. P. ACK1/TNK2 tyrosine kinase: molecular signaling and evolving role in cancers. Oncogene 34, 4162–4167 (2015).
pubmed: 25347744 doi: 10.1038/onc.2014.350
Mahajan, K. et al. Ack1 tyrosine kinase activation correlates with pancreatic cancer progression. Am. J. Pathol. 180, 1386–1393 (2012).
pubmed: 22322295 pmcid: 3349895 doi: 10.1016/j.ajpath.2011.12.028
Murakami, M. et al. Recent progress in phospholipase A2 research: from cells to animals to humans. Prog. Lipid Res. 50, 152–192 (2011).
pubmed: 21185866 doi: 10.1016/j.plipres.2010.12.001
Zhang, Y. et al. LncRNA-BC069792 suppresses tumor progression by targeting KCNQ4 in breast cancer. Mol. Cancer 22, 41 (2023).
pubmed: 36859185 pmcid: 9976483 doi: 10.1186/s12943-023-01747-5
Bedi, U. et al. SUPT6H controls estrogen receptor activity and cellular differentiation by multiple epigenomic mechanisms. Oncogene 34, 465–473 (2015).
pubmed: 24441044 doi: 10.1038/onc.2013.558
Hossain, K. A. et al. How acidic amino acid residues facilitate DNA target site selection. Proc. Natl Acad. Sci. 120, e2212501120 (2023).
pubmed: 36634135 pmcid: 9934023 doi: 10.1073/pnas.2212501120
Fugmann, S. D. & Schatz, D. G. Identification of basic residues in RAG2 critical for DNA binding by the RAG1-RAG2 complex. Mol. Cell 8, 899–910 (2001).
pubmed: 11684024 doi: 10.1016/S1097-2765(01)00352-5
Pedone, P. V. et al. The single Cys2-His2 zinc finger domain of the GAGA protein flanked by basic residues is sufficient for high-affinity specific DNA binding. Proc. Natl Acad. Sci. 93, 2822–2826 (1996).
pubmed: 8610125 pmcid: 39717 doi: 10.1073/pnas.93.7.2822
Xu, C. et al. DNA sequence recognition of human CXXC domains and their structural determinants. Structure 26, 85–95.e83 (2018).
pubmed: 29276034 doi: 10.1016/j.str.2017.11.022
Frauer, C. et al. Different binding properties and function of CXXC zinc finger domains in Dnmt1 and Tet1. PloS ONE 6, e16627 (2011).
pubmed: 21311766 pmcid: 3032784 doi: 10.1371/journal.pone.0016627
Persikov, A. V. et al. A systematic survey of the Cys2His2 zinc finger DNA-binding landscape. Nucleic Acids Res. 43, 1965–1984 (2015).
pubmed: 25593323 pmcid: 4330361 doi: 10.1093/nar/gku1395
Razin, S., Borunova, V., Maksimenko, O. & Kantidze, O. Cys2His2 zinc finger protein family: classification, functions, and major members. Biochemistry 77, 217–226 (2012).
pubmed: 22803940
Zhu, H. & Snyder, M. Protein chip technology. Curr. Opin. Chem. Biol. 7, 55–63 (2003).
pubmed: 12547427 doi: 10.1016/S1367-5931(02)00005-4
Quail, M. A. et al. A large genome center’s improvements to the Illumina sequencing system. Nat. Methods 5, 1005–1010 (2008).
pubmed: 19034268 pmcid: 2610436 doi: 10.1038/nmeth.1270
Zeng, W., Dou, Y., Pan, L., Xu, L. & Peng, S. Interpretable improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein. https://github.com/pengsl-lab/ESM-DBP . Zendo https://doi.org/10.5281/zenodo.13207718 (2024).
Yuan, S. G., Chan, H. C. S. & Hu, Z. Q. Using PyMOL as a platform for computational drug design. Wiley Interdisciplinary Rev. Comput. Mol. Sci. 7, e1298 (2017).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn Res. 9, 2579–2605 (2008).
Kokhlikyan N., et al. Captum: a unified and generic model interpretability library for PyTorch. arXiv 2020. arXiv preprint arXiv:07896. https://doi.org/10.48550/arXiv.2009.07896 (2021).

Auteurs

Wenwu Zeng (W)

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.

Yutao Dou (Y)

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.

Liangrui Pan (L)

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.

Liwen Xu (L)

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China. xuliwen@hnu.edu.cn.

Shaoliang Peng (S)

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China. slpeng@hnu.edu.cn.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH