Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein.

DNA-Binding Proteins / metabolism Computational Biology / methods Protein Domains Humans DNA / metabolism Protein Binding Algorithms

Journal

Nature communications

ISSN: 2041-1723

Titre abrégé: Nat Commun

Pays: England

ID NLM: 101528555

Informations de publication

Date de publication:
07 Sep 2024

Historique:

received: 18 12 2023

accepted: 29 08 2024

medline: 8 9 2024

pubmed: 8 9 2024

entrez: 7 9 2024

Statut: epublish

Résumé

DNA-protein interactions exert the fundamental structure of many pivotal biological processes, such as DNA replication, transcription, and gene regulation. However, accurate and efficient computational methods for identifying these interactions are still lacking. In this study, we propose a method ESM-DBP through refining the DNA-binding protein sequence repertory and domain-adaptive pretraining based the general protein language model. Our method considers the lacking exploration of general language model for DNA-binding protein domain-specific knowledge, so we screen out 170,264 DNA-binding protein sequences to construct the domain-adaptive language model. Experimental results on four downstream tasks show that ESM-DBP provides a better feature representation of DNA-binding protein compared to the original language model, resulting in improved prediction performance and outperforming the state-of-the-art methods. Moreover, ESM-DBP can still perform well even for those sequences with only a few homologous sequences. ChIP-seq on two predicted cases further support the validity of the proposed method.

Identifiants

DOI: 10.1038/s41467-024-52293-7 PMID: 39244557

pubmed: 39244557

doi: 10.1038/s41467-024-52293-7

pii: 10.1038/s41467-024-52293-7

doi:

Substances chimiques

DNA-Binding Proteins 0

DNA 9007-49-2

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

7838

Informations de copyright

Références

Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).

pubmed: 29425488 doi: 10.1016/j.cell.2018.01.029

Lazarian, G. et al. A hotspot mutation in transcription factor IKZF3 drives B cell neoplasia via transcriptional dysregulation. Cancer Cell 39, 380–393.e388 (2021).

pubmed: 33689703 pmcid: 8034546 doi: 10.1016/j.ccell.2021.02.003

Lu, T. et al. REST and stress resistance in ageing and Alzheimer’s disease. Nature 507, 448–454 (2014).

pubmed: 24670762 pmcid: 4110979 doi: 10.1038/nature13163

Esmaeeli, R., Bauzá, A. & Perez, A. Structural predictions of protein–DNA binding: MELD-DNA. Nucleic Acids Res. 51, 1625–1636 (2023).

pubmed: 36727436 pmcid: 9976882 doi: 10.1093/nar/gkad013

Shandar, A. et al. Integrating sequence and gene expression information predicts genome-wide DNA-binding proteins and suggests a cooperative mechanism. Nucleic Acids Res. 46, 54–70 (2017).

Baek, M. et al. Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat. Methods 21, 117–121 (2024).

pubmed: 37996753 doi: 10.1038/s41592-023-02086-5

Bateman, A. et al. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).

doi: 10.1093/nar/gky1049

Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).

pubmed: 34232869 doi: 10.1109/TPAMI.2021.3095381

Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).

pubmed: 36702895 pmcid: 10400306 doi: 10.1038/s41587-022-01618-2

Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

pubmed: 35896542 pmcid: 9329459 doi: 10.1038/s41467-022-32007-7

Lin, P., Tao, H., Li, H. & Huang, S.-Y. Protein–protein contact prediction by geometric triangle-aware protein language models. Nat. Mach. Intell. 5, 1275–1284 (2023).

doi: 10.1038/s42256-023-00741-2

Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42, 275–283 (2023).

pubmed: 37095349 pmcid: 10869273 doi: 10.1038/s41587-023-01763-2

Brandes, N., Goldman, G., Wang, C. H., Ye, C. J. & Ntranos, V. Genome-wide prediction of disease variant effects with a deep protein language model. Nat. Genet. 55, 15121–11522 (2023).

doi: 10.1038/s41588-023-01465-0

Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

pubmed: 33876751 pmcid: 8053943 doi: 10.1073/pnas.2016239118

Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

pubmed: 36927031 doi: 10.1126/science.ade2574

Wang, W., Peng, Z. & Yang, J. Single-sequence protein structure prediction using supervised transformer protein language models. Nat. Comput. Sci. 2, 804–814 (2022).

pubmed: 38177395 doi: 10.1038/s43588-022-00373-3

Chowdhury, R. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022).

pubmed: 36192636 pmcid: 10440047 doi: 10.1038/s41587-022-01432-w

Fang, X. et al. A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nat. Mach. Intell. 5, 1087–1096 (2023).

doi: 10.1038/s42256-023-00721-6

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

pubmed: 34265844 pmcid: 8371605 doi: 10.1038/s41586-021-03819-2

Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).

pubmed: 34282049 pmcid: 7612213 doi: 10.1126/science.abj8754

Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

pubmed: 25398609 doi: 10.1093/bioinformatics/btu739

Liu, Y. & Tian, B. Protein-DNA binding sites prediction based on pre-trained protein language model and contrastive learning. Brief. Bioinforma. 25, bbad488 (2024).

doi: 10.1093/bib/bbad488

Zhu, Y.-H., Liu, Z., Liu, Y., Ji, Z. & Yu, D.-J. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein–DNA binding site prediction. Brief. Bioinforma. 25, bbae040 (2024).

doi: 10.1093/bib/bbae040

Rao, R. M. et al. MSA transformer. Proc. 38th Int. Conf. Mach. Learn. PMLR 139, 8844–8856 (2021).

Roche, R., Moussad, B., Shuvo, M. H., Tarafder, S. & Bhattacharya, D. EquiPNAS: improved protein–nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. Nucleic Acids Res. 52, e27 (2024).

pubmed: 38281252 pmcid: 10954458 doi: 10.1093/nar/gkae039

Zeng, W. et al. ESM-NBR: fast and accurate nucleic acid-binding residue prediction via protein language model feature representation and multi-task learning. 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 76–81 (2023).

Gururangan, S. et al. Don’t stop pretraining: adapt language models to domains and tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8342–8360 (2020).

Hu, J. et al. Improving DNA-binding protein prediction using three-part sequence-order feature extraction and a deep neural network algorithm. J. Chem. Inf. Model 63, 1044–1057 (2023).

pubmed: 36719781 doi: 10.1021/acs.jcim.2c00943

Zeng, W. et al. LBi-DBP, an accurate DNA-binding protein prediction method based lightweight interpretable BiLSTM network. Expert Syst. Appl. 249, 123525 (2024).

doi: 10.1016/j.eswa.2024.123525

Hu, J. et al. Protein-DNA binding residue prediction via bagging strategy and sequence-based cube-format feature. IEEE-Acm Trans. Comput. Biol. Bioinforma. 19, 3635–3645 (2022).

doi: 10.1109/TCBB.2021.3123828

Kim, G. B., Gao, Y., Palsson, B. O. & Lee, S. Y. DeepTFactor: a deep learning-based tool for the prediction of transcription factors. Proc. Natl Acad. Sci. USA 118, e2021171118 (2021).

pubmed: 33372147 doi: 10.1073/pnas.2021171118

Aizenshtein-Gazit, S. & Orenstein, Y. J. B. DeepZF: improved DNA-binding prediction of C2H2-zinc-finger proteins by deep transfer learning. Bioinformatics 38, ii62–ii67 (2022).

pubmed: 36124796 doi: 10.1093/bioinformatics/btac469

Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).

pubmed: 23060610 pmcid: 3516142 doi: 10.1093/bioinformatics/bts565

Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175 (2012).

doi: 10.1038/nmeth.1818

Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

pubmed: 9254694 pmcid: 146917 doi: 10.1093/nar/25.17.3389

Park, P. J. ChIP–seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680 (2009).

pubmed: 19736561 pmcid: 3191340 doi: 10.1038/nrg2641

Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).

pubmed: 27899574 doi: 10.1093/nar/gkw1081

Du, X., Diao, Y., Liu, H. & Li, S. MsDBP: exploring DNA-binding proteins by integrating multiscale sequence information via Chou’s five-step rule. J. Proteome Res. 18, 3119–3132 (2019).

pubmed: 31267738 doi: 10.1021/acs.jproteome.9b00226

Chowdhury, S. Y., Shatabda, S. & Dehzangi, A. iDNAProt-ES: identification of DNA-binding proteins using evolutionary and structural features. Sci. Rep. 7, 14938 (2017).

pubmed: 29097781 pmcid: 5668250 doi: 10.1038/s41598-017-14945-1

Hu, J., Rao, L., Zhu, Y. H., Zhang, G. J. & Yu, D. J. TargetDBP + : enhancing the performance of identifying DNA-binding proteins via weighted convolutional features. J. Chem. Inf. Model 61, 505–515 (2021).

pubmed: 33410688 doi: 10.1021/acs.jcim.0c00735

Zhang, J., Chen, Q. C. & Liu, B. iDRBP_MMC: identifying DNA-binding proteins and RNA-binding proteins based on multi-label learning model and motif-based convolutional neural network. J. Mol. Biol. 432, 5860–5875 (2020).

pubmed: 32920048 doi: 10.1016/j.jmb.2020.09.008

Feng, J., Wang, N., Zhang, J. & Liu, B. iDRBP-ECHF: identifying DNA- and RNA-binding proteins based on extensible cubic hybrid framework. Comput. Biol. Med. 149, 105940 (2022).

pubmed: 36044786 doi: 10.1016/j.compbiomed.2022.105940

Wang, N., Zhang, J. & Liu, B. IDRBP-PPCT: identifying nucleic acid-binding proteins based on position-specific score matrix and position-specific frequency matrix cross transformation. IEEE/ACM Trans. Comput. Biol. Bioinforma. 19, 2284–2293 (2022).

doi: 10.1109/TCBB.2021.3069263

Yan, K., Feng, J., Huang, J. & Wu, H. iDRPro-SC: identifying DNA-binding proteins and RNA-binding proteins based on subfunction classifiers. Brief. Bioinforma. 24, bbad251 (2023).

doi: 10.1093/bib/bbad251

Zhang, J., Chen, Q. C. & Liu, B. NCBRPred: predicting nucleic acid binding residues in proteins based on multilabel learning. Brief. Bioinforma. 22, bbaa397 (2021).

doi: 10.1093/bib/bbaa397

Wang, N., Yan, K., Zhang, J. & Liu, B. iDRNA-ITF: identifying DNA- and RNA-binding residues in proteins based on induction and transfer framework. Brief. Bioinforma. 23, bbac236 (2022).

doi: 10.1093/bib/bbac236

Yan, J. & Kurgan, L. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues. Nucleic Acids Res. 45, e84 (2017).

pubmed: 28132027 pmcid: 5449545

Nguyen, B. P., Nguyen, Q. H., Doan-Ngoc, G.-N., Nguyen-Vo, T.-H. & Rahardja, S. iProDNA-CapsNet: identifying protein-DNA binding residues using capsule neural networks. BMC Bioinforma. 20, 634 (2019).

doi: 10.1186/s12859-019-3295-2

Hu, J. et al. Predicting protein-DNA binding residues by weightedly combining sequence-based features and boosting multiple SVMs. IEEE/ACM Trans. Comput. Biol. Bioinforma. 14, 1389–1398 (2016).

doi: 10.1109/TCBB.2016.2616469

Zhu, Y. H., Hu, J., Song, X. N. & Yu, D. J. DNAPred: accurate identification of DNA-binding sites from protein sequence by ensembled hyperplane-distance-based support vector machines. J. Chem. Inf. Model 59, 3057–3071 (2019).

pubmed: 30943723 doi: 10.1021/acs.jcim.8b00749

Xia, Y., Xia, C. Q., Pan, X. Y. & Shen, H. B. GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res. 49, e51 (2021).

pubmed: 33577689 pmcid: 8136796 doi: 10.1093/nar/gkab044

Liu, M.-L. et al. Predicting preference of transcription factors for methylated DNA using sequence information. Mol. Ther.-Nucleic Acids 22, 1043–1050 (2020).

pubmed: 33294291 pmcid: 7691157 doi: 10.1016/j.omtn.2020.07.035

Li, H., Gong, Y., Liu, Y., Lin, H. & Wang, G. Detection of transcription factors binding to methylated DNA by deep recurrent neural network. Brief. Bioinforma. 23, bbab533 (2022).

doi: 10.1093/bib/bbab533

Sundararajan, M., Taly, A. & Yan, Q. Q. Axiomatic attribution for deep networks. Proceedings of the 34th International Conference on Machine Learning. PMLR, 70, 3319–3328 (2017).

Brodsky, S., Jana, T. & Barkai, N. Order through disorder: the role of intrinsically disordered regions in transcription factor binding specificity. Curr. Opin. Struct. Biol. 71, 110–115 (2021).

pubmed: 34303077 doi: 10.1016/j.sbi.2021.06.011

Kumar, D. K. et al. Complementary strategies for directing in vivo transcription factor binding through DNA binding domains and intrinsically disordered regions. Mol. Cell 83, 1462–1473. e1465 (2023).

pubmed: 37116493 doi: 10.1016/j.molcel.2023.04.002

Peng, Z. & Kurgan, L. High-throughput prediction of RNA, DNA and protein binding regions mediated by intrinsic disorder. Nucleic Acids Res. 43, e121 (2015).

pubmed: 26109352 pmcid: 4605291 doi: 10.1093/nar/gkv585

Wang, X. et al. Negatively charged, intrinsically disordered regions can accelerate target search by DNA-binding proteins. Nucleic Acids Res. 51, 4701–4712 (2023).

pubmed: 36774964 pmcid: 10250230 doi: 10.1093/nar/gkad045

Zhang, F., Li, M., Zhang, J. & Kurgan, L. HybridRNAbind: prediction of RNA interacting residues across structure-annotated and disorder-annotated proteins. Nucleic Acids Res. 51, e25 (2023).

pubmed: 36629262 pmcid: 10018345 doi: 10.1093/nar/gkac1253

Silva, L. A., Loregian, A., Pari, G. S., Strang, B. L. & Coen, D. M. The carboxy-terminal segment of the human cytomegalovirus DNA polymerase accessory subunit UL44 is crucial for viral replication. J. Virol. 84, 11563–11568 (2010).

pubmed: 20739543 pmcid: 2953201 doi: 10.1128/JVI.01033-10

Zheng, R. et al. Cistrome data browser: expanded datasets and new tools for gene regulatory analysis. Nucleic Acids Res. 47, D729–D735 (2018).

pmcid: 6324081 doi: 10.1093/nar/gky1094

Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).

pubmed: 18798982 pmcid: 2592715 doi: 10.1186/gb-2008-9-9-r137

Zhang, Y. et al. GTF2E2 is a novel biomarker for recurrence after surgery and promotes progression of esophageal squamous cell carcinoma via miR-139-5p/GTF2E2/FUS axis. Oncogene 41, 782–796 (2022).

pubmed: 34853466 doi: 10.1038/s41388-021-02122-8

Bi, G. et al. Knockdown of GTF2E2 inhibits the growth and progression of lung adenocarcinoma via RPS4X in vitro and in vivo. Cancer Cell Int. 21, 181 (2021).

pubmed: 33757492 pmcid: 7989205 doi: 10.1186/s12935-021-01878-z

Qiao, X. et al. GTF2E2 downregulated by miR-340-5p inhibits the malignant progression of glioblastoma. Cancer Gene Ther. 30, 1702–1714 (2023).

pubmed: 37845349 doi: 10.1038/s41417-023-00676-1

Mahajan, K. & Mahajan, N. P. ACK1/TNK2 tyrosine kinase: molecular signaling and evolving role in cancers. Oncogene 34, 4162–4167 (2015).

pubmed: 25347744 doi: 10.1038/onc.2014.350

Mahajan, K. et al. Ack1 tyrosine kinase activation correlates with pancreatic cancer progression. Am. J. Pathol. 180, 1386–1393 (2012).

pubmed: 22322295 pmcid: 3349895 doi: 10.1016/j.ajpath.2011.12.028

Murakami, M. et al. Recent progress in phospholipase A2 research: from cells to animals to humans. Prog. Lipid Res. 50, 152–192 (2011).

pubmed: 21185866 doi: 10.1016/j.plipres.2010.12.001

Zhang, Y. et al. LncRNA-BC069792 suppresses tumor progression by targeting KCNQ4 in breast cancer. Mol. Cancer 22, 41 (2023).

pubmed: 36859185 pmcid: 9976483 doi: 10.1186/s12943-023-01747-5

Bedi, U. et al. SUPT6H controls estrogen receptor activity and cellular differentiation by multiple epigenomic mechanisms. Oncogene 34, 465–473 (2015).

pubmed: 24441044 doi: 10.1038/onc.2013.558

Hossain, K. A. et al. How acidic amino acid residues facilitate DNA target site selection. Proc. Natl Acad. Sci. 120, e2212501120 (2023).

pubmed: 36634135 pmcid: 9934023 doi: 10.1073/pnas.2212501120

Fugmann, S. D. & Schatz, D. G. Identification of basic residues in RAG2 critical for DNA binding by the RAG1-RAG2 complex. Mol. Cell 8, 899–910 (2001).

pubmed: 11684024 doi: 10.1016/S1097-2765(01)00352-5

Pedone, P. V. et al. The single Cys2-His2 zinc finger domain of the GAGA protein flanked by basic residues is sufficient for high-affinity specific DNA binding. Proc. Natl Acad. Sci. 93, 2822–2826 (1996).

pubmed: 8610125 pmcid: 39717 doi: 10.1073/pnas.93.7.2822

Xu, C. et al. DNA sequence recognition of human CXXC domains and their structural determinants. Structure 26, 85–95.e83 (2018).

pubmed: 29276034 doi: 10.1016/j.str.2017.11.022

Frauer, C. et al. Different binding properties and function of CXXC zinc finger domains in Dnmt1 and Tet1. PloS ONE 6, e16627 (2011).

pubmed: 21311766 pmcid: 3032784 doi: 10.1371/journal.pone.0016627

Persikov, A. V. et al. A systematic survey of the Cys2His2 zinc finger DNA-binding landscape. Nucleic Acids Res. 43, 1965–1984 (2015).

pubmed: 25593323 pmcid: 4330361 doi: 10.1093/nar/gku1395

Razin, S., Borunova, V., Maksimenko, O. & Kantidze, O. Cys2His2 zinc finger protein family: classification, functions, and major members. Biochemistry 77, 217–226 (2012).

pubmed: 22803940

Zhu, H. & Snyder, M. Protein chip technology. Curr. Opin. Chem. Biol. 7, 55–63 (2003).

pubmed: 12547427 doi: 10.1016/S1367-5931(02)00005-4

Quail, M. A. et al. A large genome center’s improvements to the Illumina sequencing system. Nat. Methods 5, 1005–1010 (2008).

pubmed: 19034268 pmcid: 2610436 doi: 10.1038/nmeth.1270

Zeng, W., Dou, Y., Pan, L., Xu, L. & Peng, S. Interpretable improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein. https://github.com/pengsl-lab/ESM-DBP . Zendo https://doi.org/10.5281/zenodo.13207718 (2024).

Yuan, S. G., Chan, H. C. S. & Hu, Z. Q. Using PyMOL as a platform for computational drug design. Wiley Interdisciplinary Rev. Comput. Mol. Sci. 7, e1298 (2017).

van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn Res. 9, 2579–2605 (2008).

Kokhlikyan N., et al. Captum: a unified and generic model interpretability library for PyTorch. arXiv 2020. arXiv preprint arXiv:07896. https://doi.org/10.48550/arXiv.2009.07896 (2021).

Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein.

Journal

Informations de publication

Résumé

Identifiants

Substances chimiques

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Wenwu Zeng (W)

Yutao Dou (Y)

Liangrui Pan (L)

Liwen Xu (L)

Shaoliang Peng (S)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH