AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.

context mixing coronavirus lossless data compression mixture of experts neural networks protein sequence compression

Journal

Entropy (Basel, Switzerland)
ISSN: 1099-4300
Titre abrégé: Entropy (Basel)
Pays: Switzerland
ID NLM: 101243874

Informations de publication

Date de publication:
26 Apr 2021
Historique:
received: 01 02 2021
revised: 19 04 2021
accepted: 22 04 2021
entrez: 30 4 2021
pubmed: 1 5 2021
medline: 1 5 2021
Statut: epublish

Résumé

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2-9% and 6-7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences' input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

Identifiants

pubmed: 33925812
pii: e23050530
doi: 10.3390/e23050530
pmc: PMC8146440
pii:
doi:

Types de publication

Journal Article

Langues

eng

Références

Bioinformatics. 2013 Jul 01;29(13):i283-90
pubmed: 23812995
Bull Math Biol. 1989;51(1):79-94
pubmed: 2706403
Bioinformatics. 2020 Sep 15;36(18):4675-4681
pubmed: 33118018
Nucleic Acids Res. 2000 Jan 1;28(1):235-42
pubmed: 10592235
Entropy (Basel). 2020 Jan 16;22(1):
pubmed: 33285880
EURASIP J Bioinform Syst Biol. 2007;:60723
pubmed: 18256727
Curr Biol. 2020 Apr 6;30(7):1346-1351.e2
pubmed: 32197085
Bioinformatics. 2019 Jan 15;35(2):227-234
pubmed: 30010777
Bioinformatics. 2019 Oct 1;35(19):3826-3828
pubmed: 30799504
Bioinformatics. 2006 Feb 15;22(4):407-12
pubmed: 16317070
J Immunol Methods. 2021 Jan;488:112906
pubmed: 33137303
Nat Commun. 2020 Dec 16;11(1):6397
pubmed: 33328453
Front Med (Lausanne). 2021 Jan 20;7:607786
pubmed: 33553204
Nature. 2020 Mar;579(7798):270-273
pubmed: 32015507
Nature. 2020 Jul;583(7815):286-289
pubmed: 32380510
Neural Comput. 2016 Jan;28(1):216-28
pubmed: 26599713
Lancet. 2020 Feb 15;395(10223):470-473
pubmed: 31986257
Bioinformatics. 2020 Mar 1;36(5):1413-1419
pubmed: 31613311
Clin Microbiol Rev. 2007 Oct;20(4):660-94
pubmed: 17934078
Science. 2012 Nov 23;338(6110):1042-6
pubmed: 23180855
Viruses. 2021 Jan 20;13(2):
pubmed: 33498157
Methods. 2014 Jun 1;67(3):380-5
pubmed: 24486717
Trends Microbiol. 2020 Jul;28(7):515-517
pubmed: 32544437
Science. 2020 Dec 4;370(6521):1144-1145
pubmed: 33273077
Nat Rev Genet. 2002 Jan;3(1):65-72
pubmed: 11823792
Emerg Infect Dis. 2013 Nov;19(11):1819-23
pubmed: 24206838
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515
pubmed: 30395287
Genome Biol. 2008;9(2):R28
pubmed: 18257913
Gigascience. 2020 Nov 11;9(11):
pubmed: 33179040
Nucleic Acids Res. 2020 Jan 8;48(D1):D682-D688
pubmed: 31691826
Nat Commun. 2021 Feb 9;12(1):972
pubmed: 33563978
Gigascience. 2020 Jul 1;9(7):
pubmed: 32627830
Entropy (Basel). 2010 Jan 1;12(1):34
pubmed: 20157640
Virus Evol. 2020 Dec 30;7(1):veaa098
pubmed: 33500788
Gigascience. 2020 May 1;9(5):
pubmed: 32432328
Nature. 2020 Jul;583(7815):178-179
pubmed: 32620885
PLoS One. 2015 Apr 09;10(4):e0119306
pubmed: 25856073
Lancet. 2021 Mar 13;397(10278):952-954
pubmed: 33581803
N Engl J Med. 2021 Jan 7;384(1):80-82
pubmed: 33270381
BMC Bioinformatics. 2007 Jul 13;8:252
pubmed: 17629909
PLoS One. 2017 Sep 29;12(9):e0185587
pubmed: 28961273
Nature. 2020 Mar;579(7798):265-269
pubmed: 32015508
Interdiscip Sci. 2019 Mar;11(1):68-76
pubmed: 30721401

Auteurs

Milton Silva (M)

IEETA-Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
Department of Electronics Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.

Diogo Pratas (D)

IEETA-Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
Department of Electronics Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
Department of Virology, University of Helsinki, 00014 Helsinki, Finland.

Armando J Pinho (AJ)

IEETA-Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
Department of Electronics Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.

Classifications MeSH