IDDLncLoc: Subcellular Localization of LncRNAs Based on a Framework for Imbalanced Data Distributions.
Ensemble model
Imbalanced learning
Sequence feature
Subcellular localization of lncRNA
Journal
Interdisciplinary sciences, computational life sciences
ISSN: 1867-1462
Titre abrégé: Interdiscip Sci
Pays: Germany
ID NLM: 101515919
Informations de publication
Date de publication:
Jun 2022
Jun 2022
Historique:
received:
21
05
2021
accepted:
20
12
2021
revised:
16
12
2021
pubmed:
23
2
2022
medline:
25
5
2022
entrez:
22
2
2022
Statut:
ppublish
Résumé
Long non-coding RNAs play a crucial role in many life processes of cell, such as genetic markers, RNA splicing, signaling, and protein regulation. Considering that identifying lncRNA's localization in the cell through experimental methods is complicated, hard to reproduce, and expensive, we propose a novel method named IDDLncLoc in this paper, which adopts an ensemble model to solve the problem of the subcellular localization. In the proposal model, dinucleotide-based auto-cross covariance features, k-mer nucleotide composition features, and composition, transition, and distribution features are introduced to encode a raw RNA sequence to vector. To screen out reliable features, feature selection through binomial distribution, and recursive feature elimination is employed. Furthermore, strategies of oversampling in mini-batch, random sampling, and stacking ensemble strategies are customized to overcome the problem of data imbalance on the benchmark dataset. Finally, compared with the latest methods, IDDLncLoc achieves an accuracy of 94.96% on the benchmark dataset, which is 2.59% higher than the best method, and the results further demonstrate IDDLncLoc is excellent on the subcellular localization of lncRNA. Besides, a user-friendly web server is available at http://lncloc.club .
Identifiants
pubmed: 35192174
doi: 10.1007/s12539-021-00497-6
pii: 10.1007/s12539-021-00497-6
doi:
Substances chimiques
Nucleotides
0
Proteins
0
RNA, Long Noncoding
0
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
409-420Subventions
Organisme : the National Natural Science Foundation of China
ID : No. 62072212
Organisme : the National Natural Science Foundation of China
ID : 61902144
Organisme : the Development Project of Jilin Province of China
ID : Nos. 20200401083GX
Organisme : the Development Project of Jilin Province of China
ID : 2020C003
Organisme : the Development Project of Jilin Province of China
ID : 20200403172SF
Organisme : Chinese Postdoctoral Science Foundation
ID : No. 801212011421
Informations de copyright
© 2022. International Association of Scientists in the Interdisciplinary Areas.
Références
Perkel JM (2013) Visiting “Noncodarnia.” Biotechniques 54:301–304. https://doi.org/10.2144/000114037
doi: 10.2144/000114037
pubmed: 23750541
Gong C, Maquat LE (2011) lncRNAs transactivate STAU1-mediated mRNA decay by duplexing with 3′UTRs via Alu elements. Nature 470:284. https://doi.org/10.1038/nature09701
doi: 10.1038/nature09701
pubmed: 21307942
pmcid: 3073508
Huarte M, Guttman M, Feldser D et al (2010) A large intergenic noncoding RNA induced by p53 mediates global gene repression in the p53 response. Cell 142:409–419. https://doi.org/10.1016/j.cell.2010.06.040
doi: 10.1016/j.cell.2010.06.040
pubmed: 20673990
pmcid: 2956184
Hung T, Wang Y, Lin MF et al (2011) Extensive and coordinated transcription of noncoding RNAs within cell-cycle promoters. Nat Genet 43:621-U196. https://doi.org/10.1038/ng.848
doi: 10.1038/ng.848
Kino T, Hurt DE, Ichijo T et al (2010) Noncoding RNA Gas5 Is a growth arrest- and starvation-associated repressor of the glucocorticoid receptor. Sci Signal. https://doi.org/10.1126/scisignal.2000568
doi: 10.1126/scisignal.2000568
pubmed: 20124551
pmcid: 2819218
Yu B, Shan G (2016) Functions of long noncoding RNAs in the nucleus. Nucleus 7:155–166. https://doi.org/10.1080/19491034.2016.1179408
doi: 10.1080/19491034.2016.1179408
pubmed: 27105038
pmcid: 4916869
Sun Q, Hao Q, Prasanth KV (2018) Nuclear long noncoding RNAs: key regulators of gene expression. Trends Genet 34:142–157. https://doi.org/10.1016/j.tig.2017.11.005
doi: 10.1016/j.tig.2017.11.005
pubmed: 29249332
pmcid: 6002860
Ahmad I, Valverde A, Ahmad F, Naqvi AR (2020) Long noncoding RNA in myeloid and lymphoid cell differentiation, polarization and function. Cells. https://doi.org/10.3390/cells9020269
doi: 10.3390/cells9020269
pubmed: 33352976
pmcid: 7767330
Schmitt AM, Chang HY (2016) Long noncoding RNAs in cancer pathways. Cancer Cell 29:452–463. https://doi.org/10.1016/j.ccell.2016.03.010
doi: 10.1016/j.ccell.2016.03.010
pubmed: 27070700
pmcid: 4831138
Tseng YY, Moriarity BS, Gong W et al (2014) PVT1 dependence in cancer with MYC copy-number increase. Nature 512:82–86. https://doi.org/10.1038/nature13311
doi: 10.1038/nature13311
pubmed: 25043044
pmcid: 4767149
Wang Y, Wang K, Zhang L et al (2020) Targeted overexpression of the long noncoding RNA ODSM can regulate osteoblast function in vitro and in vivo. Cell Death Dis. https://doi.org/10.1038/s41419-020-2325-3
doi: 10.1038/s41419-020-2325-3
pubmed: 33414409
pmcid: 7791068
Liu B, Sun L, Liu Q et al (2015) A cytoplasmic NF-κB interacting long noncoding RNA blocks IκB phosphorylation and suppresses breast cancer metastasis. Cancer Cell 27:370–381. https://doi.org/10.1016/j.ccell.2015.02.004
doi: 10.1016/j.ccell.2015.02.004
pubmed: 25759022
Hu Y-P, Jin Y-P, Wu X-S et al (2019) LncRNA-HGBC stabilized by HuR promotes gallbladder cancer progression by regulating miR-502-3p/SET/AKT axis. Mol Cancer 18:167. https://doi.org/10.1186/s12943-019-1097-9
doi: 10.1186/s12943-019-1097-9
pubmed: 31752906
pmcid: 6868746
Kang CM, Bai HL, Li XH et al (2019) The binding of lncRNA RP11-732M18.3 with 14–3-3 β/α accelerates p21 degradation and promotes glioma growth. EBioMedicine 45:58–69. https://doi.org/10.1016/j.ebiom.2019.06.002
doi: 10.1016/j.ebiom.2019.06.002
pubmed: 31202814
pmcid: 6642068
Dohm JC, Lottaz C, Borodina T, Himmelbauer H (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. https://doi.org/10.1093/nar/gkn425
doi: 10.1093/nar/gkn425
pubmed: 18660515
pmcid: 2532726
Saiki RK, Scharf S, Faloona F et al (1992) Enzymatic amplification of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. 1985. Biotechnology 24:476–480. https://doi.org/10.1007/BF00985904
doi: 10.1007/BF00985904
pubmed: 1422056
Maclary E, Buttigieg E, Hinten M et al (2014) Differentiation-dependent requirement of Tsix long non-coding RNA in imprinted X-chromosome inactivation. Nat Commun 5:1–14. https://doi.org/10.1038/ncomms5209
doi: 10.1038/ncomms5209
Hacisuleyman E, Goff LA, Trapnell C et al (2014) Topological organization of multichromosomal regions by the long intergenic noncoding RNA Firre. Nat Struct Mol Biol 21:198–206. https://doi.org/10.1038/nsmb.2764
doi: 10.1038/nsmb.2764
pubmed: 24463464
pmcid: 3950333
Woźniak M, Połap D, Kośmider L, Cłapa T (2018) Automated fluorescence microscopy image analysis of Pseudomonas aeruginosa bacteria in alive and dead stadium. Eng Appl Artif Intell 67:100–110. https://doi.org/10.1016/j.engappai.2017.09.003
doi: 10.1016/j.engappai.2017.09.003
Feng P, Zhang J, Tang H et al (2017) Predicting the organelle location of noncoding RNAs using pseudo nucleotide compositions. Interdiscip Sci Comput Life Sci 9:540–544. https://doi.org/10.1007/s12539-016-0193-4
doi: 10.1007/s12539-016-0193-4
Cheng X, Xiao X, Chou KC (2018) pLoc-mEuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics 110:50–58. https://doi.org/10.1016/j.ygeno.2017.08.005
doi: 10.1016/j.ygeno.2017.08.005
pubmed: 28818512
Cao Z, Pan X, Yang Y et al (2018) The lncLocator: a subcellular localization predictor for long non-coding RNAs based on a stacked ensemble classifier. Bioinformatics 34:2185–2194. https://doi.org/10.1093/bioinformatics/bty085
doi: 10.1093/bioinformatics/bty085
pubmed: 29462250
Su ZD, Huang Y, Zhang ZY et al (2018) ILoc-lncRNA: Predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics 34:4196–4204. https://doi.org/10.1093/bioinformatics/bty508
doi: 10.1093/bioinformatics/bty508
pubmed: 29931187
Chen Z, Zhao P, Li F et al (2020) iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform 21:1047–1057. https://doi.org/10.1093/bib/bbz041
doi: 10.1093/bib/bbz041
pubmed: 31067315
Wei L, Zhou C, Chen H et al (2018) ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics 34:4007–4016. https://doi.org/10.1093/bioinformatics/bty451
doi: 10.1093/bioinformatics/bty451
pubmed: 29868903
pmcid: 6247924
Granitto PM, Furlanello C, Biasioli F, Gasperi F (2006) Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemom Intell Lab Syst 83:83–90. https://doi.org/10.1016/j.chemolab.2006.01.007
doi: 10.1016/j.chemolab.2006.01.007
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. https://doi.org/10.1093/bioinformatics/btl158
doi: 10.1093/bioinformatics/btl158
pubmed: 16731699
Liu B, Liu F, Wang X et al (2015) Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 43:W65–W71. https://doi.org/10.1093/nar/gkv458
doi: 10.1093/nar/gkv458
pubmed: 25958395
pmcid: 4489303
Dong Q, Zhou S, Guan J (2009) A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics 25:2655–2662. https://doi.org/10.1093/bioinformatics/btp500
doi: 10.1093/bioinformatics/btp500
pubmed: 19706744
Chen W, Zhang X, Brooker J et al (2015) PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics 31:119–120. https://doi.org/10.1093/bioinformatics/btu602
doi: 10.1093/bioinformatics/btu602
pubmed: 25231908
Chen W, Lei T-Y, Jin D-C et al (2014) PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Anal Biochem 456:53–60. https://doi.org/10.1016/j.ab.2014.04.001
doi: 10.1016/j.ab.2014.04.001
pubmed: 24732113
Zhu L, Yang J, Song J-N et al (2010) Improving the accuracy of predicting disulfide connectivity by feature selection. J Comput Chem 31:1478–1485. https://doi.org/10.1002/jcc.21433
doi: 10.1002/jcc.21433
pubmed: 20127740
Chen W, Yang H, Feng P et al (2017) IDNA4mC: identifying DNA N 4 -methylcytosine sites based on nucleotide chemical properties. Bioinformatics 33:3518–3523. https://doi.org/10.1093/bioinformatics/btx479
doi: 10.1093/bioinformatics/btx479
pubmed: 28961687
Ding C, Yuan L-F, Guo S-H et al (2012) Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. J Proteom 77:321–328. https://doi.org/10.1016/j.jprot.2012.09.006
doi: 10.1016/j.jprot.2012.09.006
Feng P-M, Chen W, Lin H, Chou K-C (2013) iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal Biochem 442:118–125. https://doi.org/10.1016/j.ab.2013.05.024
doi: 10.1016/j.ab.2013.05.024
pubmed: 23756733
Tang H, Su Z-D, Wei H-H et al (2016) Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun 477:150–154. https://doi.org/10.1016/j.bbrc.2016.06.035
doi: 10.1016/j.bbrc.2016.06.035
pubmed: 27291150
Wang T, Yang J, Shen H-B, Chou K-C (2008) Predicting membrane protein types by the LLDA algorithm. Protein Pept Lett 15:915–921. https://doi.org/10.2174/092986608785849308
doi: 10.2174/092986608785849308
pubmed: 18991767
Yang H, Tang H, Chen X-X et al (2016) Identification of secretory proteins in mycobacterium tuberculosis using pseudo amino acid composition. Biomed Res Int. https://doi.org/10.1155/2016/5413903
doi: 10.1155/2016/5413903
pubmed: 28119926
pmcid: 5227121
Zhao Y-W, Lai H-Y, Tang H et al (2016) Prediction of phosphothreonine sites in human proteins by fusing different features. Sci Rep. https://doi.org/10.1038/srep34817
doi: 10.1038/srep34817
pubmed: 28009010
pmcid: 5180247
Zhao Y-W, Su Z-D, Yang W et al (2017) IonchanPred 2.0: a tool to predict ion channels and their types. Int J Mol Sci. https://doi.org/10.3390/ijms18091838
doi: 10.3390/ijms18091838
pubmed: 29286291
pmcid: 5796034
Lai H-Y, Chen X-X, Chen W et al (2017) Sequence-based predictive modeling to identify cancer lectins. Oncotarget 8:28169–28175. https://doi.org/10.18632/oncotarget.15963
doi: 10.18632/oncotarget.15963
pubmed: 28423655
pmcid: 5438640
Virtanen P, Gommers R, Oliphant TE et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17:261–272. https://doi.org/10.1038/s41592-019-0686-2
doi: 10.1038/s41592-019-0686-2
pubmed: 32015543
pmcid: 7056644
Lee J, Yoon W, Kim S et al (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36:1234–1240. https://doi.org/10.1093/bioinformatics/btz682
doi: 10.1093/bioinformatics/btz682
pmcid: 7703786
Liu L, Ouyang W, Wang X et al (2020) Deep learning for generic object detection: a survey. Int J Comput Vis 128:261–318. https://doi.org/10.1007/s11263-019-01247-4
doi: 10.1007/s11263-019-01247-4
Kang Q, Shi L, Zhou M et al (2018) A distance-based weighted undersampling scheme for support vector machines and its application to imbalanced classification. IEEE Trans Neural Netw Learn Syst 29:4152–4165. https://doi.org/10.1109/TNNLS.2017.2755595
doi: 10.1109/TNNLS.2017.2755595
pubmed: 29990027
Fan Y, Chen M, Zhu Q (2020) LncLocPred: predicting LncRNA subcellular localization using multiple sequence feature information. IEEE Access 8:124702–124711. https://doi.org/10.1109/ACCESS.2020.3007317
doi: 10.1109/ACCESS.2020.3007317
Ahmad A, Lin H, Shatabda S (2020) Locate-R: subcellular localization of long non-coding RNAs using nucleotide compositions. Genomics 112:2583–2589. https://doi.org/10.1016/j.ygeno.2020.02.011
doi: 10.1016/j.ygeno.2020.02.011
pubmed: 32068122