sRNAdeep: a novel tool for bacterial sRNA prediction based on DistilBERT encoding mode and deep learning algorithms.


Journal

BMC genomics
ISSN: 1471-2164
Titre abrégé: BMC Genomics
Pays: England
ID NLM: 100965258

Informations de publication

Date de publication:
31 Oct 2024
Historique:
received: 25 07 2024
accepted: 24 10 2024
medline: 1 11 2024
pubmed: 1 11 2024
entrez: 1 11 2024
Statut: epublish

Résumé

Bacterial small regulatory RNA (sRNA) plays a crucial role in cell metabolism and could be used as a new potential drug target in the treatment of pathogen-induced disease. However, experimental methods for identifying sRNAs still require a large investment of human and material resources. In this study, we propose a novel sRNA prediction model called sRNAdeep based on the DistilBERT feature extraction and TextCNN methods. The sRNA and non-sRNA sequences of bacteria were considered as sentences and then fed into a composite model consisting of deep learning models to evaluate classification performance. By filtering sRNAs from BSRD database, we obtained a validation dataset comprised of 2438 positive and 4730 negative samples. The benchmark experiments showed that sRNAdeep displayed better performance in the various indexes compared to previous sRNA prediction tools. By applying our tool to Mycobacterium tuberculosis (MTB) genome, we have identified 21 sRNAs within the intergenic and intron regions. A set of 272 targeted genes regulated by these sRNAs were also captured in MTB. The coding proteins of two genes (lysX and icd1) are implicated in drug response, with significant active sites related to drug resistance mechanisms of MTB. In conclusion, our newly developed sRNAdeep can help researchers identify bacterial sRNAs more precisely and can be freely available from https://github.com/pyajagod/sRNAdeep.git .

Sections du résumé

BACKGROUND BACKGROUND
Bacterial small regulatory RNA (sRNA) plays a crucial role in cell metabolism and could be used as a new potential drug target in the treatment of pathogen-induced disease. However, experimental methods for identifying sRNAs still require a large investment of human and material resources.
METHODS METHODS
In this study, we propose a novel sRNA prediction model called sRNAdeep based on the DistilBERT feature extraction and TextCNN methods. The sRNA and non-sRNA sequences of bacteria were considered as sentences and then fed into a composite model consisting of deep learning models to evaluate classification performance.
RESULTS RESULTS
By filtering sRNAs from BSRD database, we obtained a validation dataset comprised of 2438 positive and 4730 negative samples. The benchmark experiments showed that sRNAdeep displayed better performance in the various indexes compared to previous sRNA prediction tools. By applying our tool to Mycobacterium tuberculosis (MTB) genome, we have identified 21 sRNAs within the intergenic and intron regions. A set of 272 targeted genes regulated by these sRNAs were also captured in MTB. The coding proteins of two genes (lysX and icd1) are implicated in drug response, with significant active sites related to drug resistance mechanisms of MTB.
CONCLUSION CONCLUSIONS
In conclusion, our newly developed sRNAdeep can help researchers identify bacterial sRNAs more precisely and can be freely available from https://github.com/pyajagod/sRNAdeep.git .

Identifiants

pubmed: 39482572
doi: 10.1186/s12864-024-10951-6
pii: 10.1186/s12864-024-10951-6
doi:

Substances chimiques

RNA, Bacterial 0
RNA, Small Untranslated 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

1021

Subventions

Organisme : National Natural Science Foundation of China
ID : 61903107

Informations de copyright

© 2024. The Author(s).

Références

Jørgensen MG, Pettersen JS, Kallipolitis BH. sRNA-mediated control in bacteria: An increasing diversity of regulatory mechanisms. Biochim Biophys Acta Gene Regul Mech. 2020;1863(5):194504.
doi: 10.1016/j.bbagrm.2020.194504 pubmed: 32061884
Brantl S, Müller P. Cis-and trans-encoded small regulatory RNAs in bacillus subtilis. Microorganisms. 2021;9(9):1865.
doi: 10.3390/microorganisms9091865 pubmed: 34576762 pmcid: 8464778
Brantl S. Small regulatory RNAs (sRNAs): key players in prokaryotic metabolism, stress response, and virulence. In: Regulatory RNAs: Basics, Methods and Applications. 2012. p. 73–109.
doi: 10.1007/978-3-642-22517-8_4
Barman RK, Mukhopadhyay A, Das S. An improved method for identification of small non-coding RNAs in bacteria using support vector machine. Sci Rep-Uk. 2017;7(1):46070.
doi: 10.1038/srep46070
Sorkhian M, Nagari M, Elsisy M, Peña-Castillo L. Improving bacterial sRNA identification by combining genomic context and sequence-derived features. In: International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer2021: 67–78.
Eppenhof EJ, Peña-Castillo L. Prioritizing bona fide bacterial small RNAs with machine learning classifiers. PeerJ. 2019;7:e6304.
doi: 10.7717/peerj.6304 pubmed: 30697489 pmcid: 6348098
Kumar K, Chakraborty A, Chakrabarti S. PresRAT: a server for identification of bacterial small-RNA sequences and their targets with probable binding region. RNA Biol. 2021;18(8):1152–9.
doi: 10.1080/15476286.2020.1836455 pubmed: 33103602
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative Adversarial Networks. Adv Neural Inf Process Syst. 2014;3:2672–80.
Maayan F-A, Diamant I, Klang E, Amitai M, Goldberger J, Greenspan H. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing. 2018;321:321–31.
doi: 10.1016/j.neucom.2018.09.013
Tan H, Lang X, He B, Lu Y, Zhang Y: GAN-based Medical Image Augmentation for Improving CNN Performance in Myositis Ultrasound Image Classification. In: 2023 6th International Conference on Electronics Technology (ICET): 2023: IEEE; 2023: 1329–1333.
Liu W-l, Wu Q-b. Analysis method and algorithm design of biological sequence problem based on generalized k-mer vector. Appl Math J Chin Univ. 2021;36(1):114–27.
doi: 10.1007/s11766-021-4033-x
Xing W, Zhang J, Li C, Huo Y, Dong G. iAMP-Attenpred: a novel antimicrobial peptide predictor based on BERT feature extraction method and CNN-BiLSTM-Attention combination model. Brief Bioinform. 2024;25(1):bbad443.
doi: 10.1093/bib/bbad443
Sanh V, Debut L, Chaumond J, Wolf T: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:191001108 2019.
Qian W, Ma N, Zeng X, Shi M, Wang M, Yang Z. Tsui SK-W: Identification of novel single nucleotide variants in the drug resistance mechanism of Mycobacterium tuberculosis isolates by whole-genome analysis. BMC Genomics. 2024;25(1):478.
doi: 10.1186/s12864-024-10390-3 pubmed: 38745294 pmcid: 11094924
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J. 2024;23:2289–303.
doi: 10.1016/j.csbj.2024.05.025 pubmed: 38840832 pmcid: 11152613
Jendele L, Krivak R, Skoda P, Novotny M, Hoksza D. PrankWeb: a web server for ligand binding site prediction and visualization. Nucleic Acids Res. 2019;47(W1):W345–9.
doi: 10.1093/nar/gkz424 pubmed: 31114880 pmcid: 6602436
Georg J, Lalaouna D, Hou S, Lott SC, Caldelari I, Marzi S, Hess WR, Romby P. The power of cooperation: experimental and computational approaches in the functional characterization of bacterial sRNAs. Mol Microbiol. 2020;113(3):603–12.
doi: 10.1111/mmi.14420 pubmed: 31705780
Kaneko T. Generative adversarial networks: Foundations and applications. Acoust Sci Technol. 2018;39(3):189–97.
doi: 10.1250/ast.39.189
Aggarwal A, Mittal M, Battineni G. Generative adversarial network: an overview of theory and applications. Int J Inform Manage Data Insights. 2021;1(1):100004.
Liu J, Yan Z, Chen S, Sun X, Luo B. Channel attention TextCNN with feature word extraction for Chinese sentiment analysis. ACM Transact Asian Low-Resour Lang Inf Process. 2023;22(4):1–23.
doi: 10.1145/3571716
Jubeh B, Breijyeh Z, Karaman R. Resistance of gram-positive bacteria to current antibacterial agents and overcoming approaches. Molecules. 2020;25(12):2888.
doi: 10.3390/molecules25122888 pubmed: 32586045 pmcid: 7356343
Zhao S, Adamiak JW, Bonifay V, Mehla J, Zgurskaya HI, Tan DS. Defining new chemical space for drug penetration into Gram-negative bacteria. Nat Chem Biol. 2020;16(12):1293–302.
doi: 10.1038/s41589-020-00674-6 pubmed: 33199906 pmcid: 7897441
Jones-Dias D, Carvalho AS, Moura IB, Manageiro V, Igrejas G, Caniça M, Matthiesen R. Quantitative proteome analysis of an antibiotic resistant Escherichia coli exposed to tetracycline reveals multiple affected metabolic and peptidoglycan processes. J Proteomics. 2017;156:20–8.
doi: 10.1016/j.jprot.2016.12.017 pubmed: 28043878
Bouz G, Zitko J. Inhibitors of aminoacyl-tRNA synthetases as antimycobacterial compounds: an up-to-date review. Bioorg Chem. 2021;110:104806.
doi: 10.1016/j.bioorg.2021.104806 pubmed: 33799176
Li L, Huang D, Cheung MK, Nong W, Huang Q, Kwan HS. BSRD: a repository for bacterial small regulatory RNA. Nucleic Acids Res. 2013;41(D1):D233–8.
doi: 10.1093/nar/gks1264 pubmed: 23203879
Qaiser S, Ali R. Text mining: use of TF-IDF to examine the relevance of words to documents. Int J Comput Appl. 2018;181(1):25–9.
Chen H, Zhang Z, Huang S, Hu J, Ni W, Liu J. TextCNN-based ensemble learning model for Japanese Text Multi-classification. Comput Electr Eng. 2023;109:108751.
doi: 10.1016/j.compeleceng.2023.108751
Aoki G, Sakakibara Y. Convolutional neural networks for classification of alignments of non-coding RNA sequences. Bioinformatics. 2018;34(13):i237–44.
doi: 10.1093/bioinformatics/bty228 pubmed: 29949978 pmcid: 6022636
Akiba T, Sano S, Yanase T, Ohta T, Koyama M: Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining: 2019;2019: 2623–2631.
Yang Z, Zeng X. Tsui K-WS: Investigating function roles of hypothetical proteins encoded by the Mycobacterium tuberculosis H37Rv genome. BMC Genomics. 2019;20(1):394.
doi: 10.1186/s12864-019-5746-6 pubmed: 31113361 pmcid: 6528289
Tjaden B. TargetRNA3: predicting prokaryotic RNA regulatory targets with machine learning. Genome Biol. 2023;24(1):276.
doi: 10.1186/s13059-023-03117-2 pubmed: 38041165 pmcid: 10691042
Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, Gable AL, Fang T, Doncheva NT, Pyysalo S. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023;51(D1):D638–46.
doi: 10.1093/nar/gkac1000 pubmed: 36370105
Franz M, Lopes CT, Fong D, Kucera M, Cheung M, Siper MC, Huck G, Dong Y, Sumer O, Bader GD. Cytoscape. js 2023 update: a graph theory library for visualization and analysis. Bioinformatics. 2023;39(1):btad031.
doi: 10.1093/bioinformatics/btad031 pubmed: 36645249 pmcid: 9889963
Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, Imamichi T, Chang W. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 2022;50(W1):W216–21.
doi: 10.1093/nar/gkac194 pubmed: 35325185 pmcid: 9252805

Auteurs

Weiye Qian (W)

School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou, 310018, P.R. China.

Jiawei Sun (J)

School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou, 310018, P.R. China.

Tianyi Liu (T)

School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou, 310018, P.R. China.

Zhiyuan Yang (Z)

School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou, 310018, P.R. China. yangzhiyuan@link.cuhk.edu.hk.
School of Biomedical Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China. yangzhiyuan@link.cuhk.edu.hk.

Stephen Kwok-Wing Tsui (SK)

School of Biomedical Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China. kwtsui@cuhk.edu.hk.
Hong Kong Bioinformatics Centre, the Chinese University of Hong Kong, Hong Kong SAR, China. kwtsui@cuhk.edu.hk.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Databases, Protein Protein Domains Protein Folding Proteins Deep Learning

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software
Coal Metagenome Phylogeny Bacteria Genome, Bacterial

Classifications MeSH