SpliceFinder: ab initio prediction of splice sites using convolutional neural network.
Canonical and non-canonical splice sites
Convolutional neural network
Splice site prediction
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
27 Dec 2019
27 Dec 2019
Historique:
entrez:
29
12
2019
pubmed:
29
12
2019
medline:
7
3
2020
Statut:
epublish
Résumé
Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.
Sections du résumé
BACKGROUND
BACKGROUND
Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing.
RESULT
RESULTS
We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining.
CONCLUSION
CONCLUSIONS
Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.
Identifiants
pubmed: 31881982
doi: 10.1186/s12859-019-3306-3
pii: 10.1186/s12859-019-3306-3
pmc: PMC6933889
doi:
Substances chimiques
RNA Splice Sites
0
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
652Références
Nucleic Acids Res. 2001 Jan 1;29(1):255-9
pubmed: 11125105
Mol Cell Biol. 1989 Jun;9(6):2765-70
pubmed: 2548089
Nucleic Acids Res. 2000 Nov 1;28(21):4364-75
pubmed: 11058137
Nucleic Acids Res. 2002 Jan 1;30(1):38-41
pubmed: 11752248
Proc Natl Acad Sci U S A. 1978 Oct;75(10):4853-7
pubmed: 283395
J Biol Chem. 2002 May 3;277(18):15241-51
pubmed: 11825891
Bioinformatics. 2019 Aug 15;35(16):2730-2737
pubmed: 30601980
Neural Comput. 1997 Nov 15;9(8):1735-80
pubmed: 9377276
Nat Rev Genet. 2010 May;11(5):345-55
pubmed: 20376054
Nucleic Acids Res. 2010 Aug;38(14):4570-8
pubmed: 20371516
Cell. 2019 Jan 24;176(3):535-548.e24
pubmed: 30661751
Plant Mol Biol. 1995 Oct;29(1):167-71
pubmed: 7579162
Mol Cell Biol. 1990 Mar;10(3):910-7
pubmed: 2106072
Nucleic Acids Res. 2001 Mar 1;29(5):1185-90
pubmed: 11222768
J Comput Biol. 1997 Fall;4(3):311-23
pubmed: 9278062
Bioinformatics. 2009 May 1;25(9):1105-11
pubmed: 19289445
Science. 2006 Jul 28;313(5786):504-7
pubmed: 16873662
Bioinformatics. 2005 Apr 15;21(8):1332-8
pubmed: 15564294
Genome Res. 2004 Jun;14(6):1188-90
pubmed: 15173120
Nucleic Acids Res. 1991 Jul 25;19(14):3795-8
pubmed: 1713664
PLoS Comput Biol. 2007 Feb 23;3(2):e20
pubmed: 17319737
Nat Methods. 2012 Nov;9(11):1041
pubmed: 23132114
Bioinformatics. 2018 Dec 15;34(24):4180-4188
pubmed: 29931149
Nucleic Acids Res. 1982 Jan 22;10(2):459-72
pubmed: 7063411
Nucleic Acids Res. 2010 Oct;38(18):e178
pubmed: 20802226
BMC Bioinformatics. 2007;8 Suppl 10:S7
pubmed: 18269701
Cold Spring Harb Symp Quant Biol. 1978;42 Pt 2:1047-51
pubmed: 98262