sRNAdeep: a novel tool for bacterial sRNA prediction based on DistilBERT encoding mode and deep learning algorithms.
Mycobacterium tuberculosis
Bacterial sRNA
Deep learning
Genome analysis
Journal
BMC genomics
ISSN: 1471-2164
Titre abrégé: BMC Genomics
Pays: England
ID NLM: 100965258
Informations de publication
Date de publication:
31 Oct 2024
31 Oct 2024
Historique:
received:
25
07
2024
accepted:
24
10
2024
medline:
1
11
2024
pubmed:
1
11
2024
entrez:
1
11
2024
Statut:
epublish
Résumé
Bacterial small regulatory RNA (sRNA) plays a crucial role in cell metabolism and could be used as a new potential drug target in the treatment of pathogen-induced disease. However, experimental methods for identifying sRNAs still require a large investment of human and material resources. In this study, we propose a novel sRNA prediction model called sRNAdeep based on the DistilBERT feature extraction and TextCNN methods. The sRNA and non-sRNA sequences of bacteria were considered as sentences and then fed into a composite model consisting of deep learning models to evaluate classification performance. By filtering sRNAs from BSRD database, we obtained a validation dataset comprised of 2438 positive and 4730 negative samples. The benchmark experiments showed that sRNAdeep displayed better performance in the various indexes compared to previous sRNA prediction tools. By applying our tool to Mycobacterium tuberculosis (MTB) genome, we have identified 21 sRNAs within the intergenic and intron regions. A set of 272 targeted genes regulated by these sRNAs were also captured in MTB. The coding proteins of two genes (lysX and icd1) are implicated in drug response, with significant active sites related to drug resistance mechanisms of MTB. In conclusion, our newly developed sRNAdeep can help researchers identify bacterial sRNAs more precisely and can be freely available from https://github.com/pyajagod/sRNAdeep.git .
Sections du résumé
BACKGROUND
BACKGROUND
Bacterial small regulatory RNA (sRNA) plays a crucial role in cell metabolism and could be used as a new potential drug target in the treatment of pathogen-induced disease. However, experimental methods for identifying sRNAs still require a large investment of human and material resources.
METHODS
METHODS
In this study, we propose a novel sRNA prediction model called sRNAdeep based on the DistilBERT feature extraction and TextCNN methods. The sRNA and non-sRNA sequences of bacteria were considered as sentences and then fed into a composite model consisting of deep learning models to evaluate classification performance.
RESULTS
RESULTS
By filtering sRNAs from BSRD database, we obtained a validation dataset comprised of 2438 positive and 4730 negative samples. The benchmark experiments showed that sRNAdeep displayed better performance in the various indexes compared to previous sRNA prediction tools. By applying our tool to Mycobacterium tuberculosis (MTB) genome, we have identified 21 sRNAs within the intergenic and intron regions. A set of 272 targeted genes regulated by these sRNAs were also captured in MTB. The coding proteins of two genes (lysX and icd1) are implicated in drug response, with significant active sites related to drug resistance mechanisms of MTB.
CONCLUSION
CONCLUSIONS
In conclusion, our newly developed sRNAdeep can help researchers identify bacterial sRNAs more precisely and can be freely available from https://github.com/pyajagod/sRNAdeep.git .
Identifiants
pubmed: 39482572
doi: 10.1186/s12864-024-10951-6
pii: 10.1186/s12864-024-10951-6
doi:
Substances chimiques
RNA, Bacterial
0
RNA, Small Untranslated
0
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
1021Subventions
Organisme : National Natural Science Foundation of China
ID : 61903107
Informations de copyright
© 2024. The Author(s).
Références
Jørgensen MG, Pettersen JS, Kallipolitis BH. sRNA-mediated control in bacteria: An increasing diversity of regulatory mechanisms. Biochim Biophys Acta Gene Regul Mech. 2020;1863(5):194504.
doi: 10.1016/j.bbagrm.2020.194504
pubmed: 32061884
Brantl S, Müller P. Cis-and trans-encoded small regulatory RNAs in bacillus subtilis. Microorganisms. 2021;9(9):1865.
doi: 10.3390/microorganisms9091865
pubmed: 34576762
pmcid: 8464778
Brantl S. Small regulatory RNAs (sRNAs): key players in prokaryotic metabolism, stress response, and virulence. In: Regulatory RNAs: Basics, Methods and Applications. 2012. p. 73–109.
doi: 10.1007/978-3-642-22517-8_4
Barman RK, Mukhopadhyay A, Das S. An improved method for identification of small non-coding RNAs in bacteria using support vector machine. Sci Rep-Uk. 2017;7(1):46070.
doi: 10.1038/srep46070
Sorkhian M, Nagari M, Elsisy M, Peña-Castillo L. Improving bacterial sRNA identification by combining genomic context and sequence-derived features. In: International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer2021: 67–78.
Eppenhof EJ, Peña-Castillo L. Prioritizing bona fide bacterial small RNAs with machine learning classifiers. PeerJ. 2019;7:e6304.
doi: 10.7717/peerj.6304
pubmed: 30697489
pmcid: 6348098
Kumar K, Chakraborty A, Chakrabarti S. PresRAT: a server for identification of bacterial small-RNA sequences and their targets with probable binding region. RNA Biol. 2021;18(8):1152–9.
doi: 10.1080/15476286.2020.1836455
pubmed: 33103602
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative Adversarial Networks. Adv Neural Inf Process Syst. 2014;3:2672–80.
Maayan F-A, Diamant I, Klang E, Amitai M, Goldberger J, Greenspan H. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing. 2018;321:321–31.
doi: 10.1016/j.neucom.2018.09.013
Tan H, Lang X, He B, Lu Y, Zhang Y: GAN-based Medical Image Augmentation for Improving CNN Performance in Myositis Ultrasound Image Classification. In: 2023 6th International Conference on Electronics Technology (ICET): 2023: IEEE; 2023: 1329–1333.
Liu W-l, Wu Q-b. Analysis method and algorithm design of biological sequence problem based on generalized k-mer vector. Appl Math J Chin Univ. 2021;36(1):114–27.
doi: 10.1007/s11766-021-4033-x
Xing W, Zhang J, Li C, Huo Y, Dong G. iAMP-Attenpred: a novel antimicrobial peptide predictor based on BERT feature extraction method and CNN-BiLSTM-Attention combination model. Brief Bioinform. 2024;25(1):bbad443.
doi: 10.1093/bib/bbad443
Sanh V, Debut L, Chaumond J, Wolf T: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:191001108 2019.
Qian W, Ma N, Zeng X, Shi M, Wang M, Yang Z. Tsui SK-W: Identification of novel single nucleotide variants in the drug resistance mechanism of Mycobacterium tuberculosis isolates by whole-genome analysis. BMC Genomics. 2024;25(1):478.
doi: 10.1186/s12864-024-10390-3
pubmed: 38745294
pmcid: 11094924
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J. 2024;23:2289–303.
doi: 10.1016/j.csbj.2024.05.025
pubmed: 38840832
pmcid: 11152613
Jendele L, Krivak R, Skoda P, Novotny M, Hoksza D. PrankWeb: a web server for ligand binding site prediction and visualization. Nucleic Acids Res. 2019;47(W1):W345–9.
doi: 10.1093/nar/gkz424
pubmed: 31114880
pmcid: 6602436
Georg J, Lalaouna D, Hou S, Lott SC, Caldelari I, Marzi S, Hess WR, Romby P. The power of cooperation: experimental and computational approaches in the functional characterization of bacterial sRNAs. Mol Microbiol. 2020;113(3):603–12.
doi: 10.1111/mmi.14420
pubmed: 31705780
Kaneko T. Generative adversarial networks: Foundations and applications. Acoust Sci Technol. 2018;39(3):189–97.
doi: 10.1250/ast.39.189
Aggarwal A, Mittal M, Battineni G. Generative adversarial network: an overview of theory and applications. Int J Inform Manage Data Insights. 2021;1(1):100004.
Liu J, Yan Z, Chen S, Sun X, Luo B. Channel attention TextCNN with feature word extraction for Chinese sentiment analysis. ACM Transact Asian Low-Resour Lang Inf Process. 2023;22(4):1–23.
doi: 10.1145/3571716
Jubeh B, Breijyeh Z, Karaman R. Resistance of gram-positive bacteria to current antibacterial agents and overcoming approaches. Molecules. 2020;25(12):2888.
doi: 10.3390/molecules25122888
pubmed: 32586045
pmcid: 7356343
Zhao S, Adamiak JW, Bonifay V, Mehla J, Zgurskaya HI, Tan DS. Defining new chemical space for drug penetration into Gram-negative bacteria. Nat Chem Biol. 2020;16(12):1293–302.
doi: 10.1038/s41589-020-00674-6
pubmed: 33199906
pmcid: 7897441
Jones-Dias D, Carvalho AS, Moura IB, Manageiro V, Igrejas G, Caniça M, Matthiesen R. Quantitative proteome analysis of an antibiotic resistant Escherichia coli exposed to tetracycline reveals multiple affected metabolic and peptidoglycan processes. J Proteomics. 2017;156:20–8.
doi: 10.1016/j.jprot.2016.12.017
pubmed: 28043878
Bouz G, Zitko J. Inhibitors of aminoacyl-tRNA synthetases as antimycobacterial compounds: an up-to-date review. Bioorg Chem. 2021;110:104806.
doi: 10.1016/j.bioorg.2021.104806
pubmed: 33799176
Li L, Huang D, Cheung MK, Nong W, Huang Q, Kwan HS. BSRD: a repository for bacterial small regulatory RNA. Nucleic Acids Res. 2013;41(D1):D233–8.
doi: 10.1093/nar/gks1264
pubmed: 23203879
Qaiser S, Ali R. Text mining: use of TF-IDF to examine the relevance of words to documents. Int J Comput Appl. 2018;181(1):25–9.
Chen H, Zhang Z, Huang S, Hu J, Ni W, Liu J. TextCNN-based ensemble learning model for Japanese Text Multi-classification. Comput Electr Eng. 2023;109:108751.
doi: 10.1016/j.compeleceng.2023.108751
Aoki G, Sakakibara Y. Convolutional neural networks for classification of alignments of non-coding RNA sequences. Bioinformatics. 2018;34(13):i237–44.
doi: 10.1093/bioinformatics/bty228
pubmed: 29949978
pmcid: 6022636
Akiba T, Sano S, Yanase T, Ohta T, Koyama M: Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining: 2019;2019: 2623–2631.
Yang Z, Zeng X. Tsui K-WS: Investigating function roles of hypothetical proteins encoded by the Mycobacterium tuberculosis H37Rv genome. BMC Genomics. 2019;20(1):394.
doi: 10.1186/s12864-019-5746-6
pubmed: 31113361
pmcid: 6528289
Tjaden B. TargetRNA3: predicting prokaryotic RNA regulatory targets with machine learning. Genome Biol. 2023;24(1):276.
doi: 10.1186/s13059-023-03117-2
pubmed: 38041165
pmcid: 10691042
Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, Gable AL, Fang T, Doncheva NT, Pyysalo S. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023;51(D1):D638–46.
doi: 10.1093/nar/gkac1000
pubmed: 36370105
Franz M, Lopes CT, Fong D, Kucera M, Cheung M, Siper MC, Huck G, Dong Y, Sumer O, Bader GD. Cytoscape. js 2023 update: a graph theory library for visualization and analysis. Bioinformatics. 2023;39(1):btad031.
doi: 10.1093/bioinformatics/btad031
pubmed: 36645249
pmcid: 9889963
Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, Imamichi T, Chang W. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 2022;50(W1):W216–21.
doi: 10.1093/nar/gkac194
pubmed: 35325185
pmcid: 9252805