SpliceFinder: ab initio prediction of splice sites using convolutional neural network.


Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
27 Dec 2019
Historique:
entrez: 29 12 2019
pubmed: 29 12 2019
medline: 7 3 2020
Statut: epublish

Résumé

Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.

Sections du résumé

BACKGROUND BACKGROUND
Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing.
RESULT RESULTS
We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining.
CONCLUSION CONCLUSIONS
Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.

Identifiants

pubmed: 31881982
doi: 10.1186/s12859-019-3306-3
pii: 10.1186/s12859-019-3306-3
pmc: PMC6933889
doi:

Substances chimiques

RNA Splice Sites 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

652

Références

Nucleic Acids Res. 2001 Jan 1;29(1):255-9
pubmed: 11125105
Mol Cell Biol. 1989 Jun;9(6):2765-70
pubmed: 2548089
Nucleic Acids Res. 2000 Nov 1;28(21):4364-75
pubmed: 11058137
Nucleic Acids Res. 2002 Jan 1;30(1):38-41
pubmed: 11752248
Proc Natl Acad Sci U S A. 1978 Oct;75(10):4853-7
pubmed: 283395
J Biol Chem. 2002 May 3;277(18):15241-51
pubmed: 11825891
Bioinformatics. 2019 Aug 15;35(16):2730-2737
pubmed: 30601980
Neural Comput. 1997 Nov 15;9(8):1735-80
pubmed: 9377276
Nat Rev Genet. 2010 May;11(5):345-55
pubmed: 20376054
Nucleic Acids Res. 2010 Aug;38(14):4570-8
pubmed: 20371516
Cell. 2019 Jan 24;176(3):535-548.e24
pubmed: 30661751
Plant Mol Biol. 1995 Oct;29(1):167-71
pubmed: 7579162
Mol Cell Biol. 1990 Mar;10(3):910-7
pubmed: 2106072
Nucleic Acids Res. 2001 Mar 1;29(5):1185-90
pubmed: 11222768
J Comput Biol. 1997 Fall;4(3):311-23
pubmed: 9278062
Bioinformatics. 2009 May 1;25(9):1105-11
pubmed: 19289445
Science. 2006 Jul 28;313(5786):504-7
pubmed: 16873662
Bioinformatics. 2005 Apr 15;21(8):1332-8
pubmed: 15564294
Genome Res. 2004 Jun;14(6):1188-90
pubmed: 15173120
Nucleic Acids Res. 1991 Jul 25;19(14):3795-8
pubmed: 1713664
PLoS Comput Biol. 2007 Feb 23;3(2):e20
pubmed: 17319737
Nat Methods. 2012 Nov;9(11):1041
pubmed: 23132114
Bioinformatics. 2018 Dec 15;34(24):4180-4188
pubmed: 29931149
Nucleic Acids Res. 1982 Jan 22;10(2):459-72
pubmed: 7063411
Nucleic Acids Res. 2010 Oct;38(18):e178
pubmed: 20802226
BMC Bioinformatics. 2007;8 Suppl 10:S7
pubmed: 18269701
Cold Spring Harb Symp Quant Biol. 1978;42 Pt 2:1047-51
pubmed: 98262

Auteurs

Ruohan Wang (R)

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China.

Zishuai Wang (Z)

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China.

Jianping Wang (J)

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China. jianwang@cityu.edu.hk.

Shuaicheng Li (S)

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China. shuaicli@cityu.edu.hk.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH