Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism.


Journal

Virus research
ISSN: 1872-7492
Titre abrégé: Virus Res
Pays: Netherlands
ID NLM: 8410979

Informations de publication

Date de publication:
06 2021
Historique:
received: 08 12 2020
revised: 23 02 2021
accepted: 16 03 2021
pubmed: 31 3 2021
medline: 30 4 2021
entrez: 30 3 2021
Statut: ppublish

Résumé

Since the onslaught of SARS-CoV-2, the research community has been searching for a vaccine to fight against this virus. However, during this period, the virus has mutated to adapt to the different environmental conditions in the world and made the task of vaccine design more challenging. In this situation, the identification of virus strains is very much timely and important task. We have performed genome-wide analysis of 10664 SARS-CoV-2 genomes of 73 countries to identify and prepare a Single Nucleotide Polymorphism (SNP) dataset of SARS-CoV-2. Thereafter, with the use of this SNP data, the advantage of hierarchical clustering is taken care of in such a way so that Average Linkage and Complete Linkage with Jaccard and Hamming distance functions are applied separately in order to identify the virus strains as clusters present in the SNP data. In this regard, the consensus of both the clustering results are also considered while Silhouette index is used as a cluster validity index to measure the goodness of the clusters as well to determine the number of clusters or virus strains. As a result, we have identified five major clusters or virus strains present worldwide. Apart from quantitative measures, these clusters are also visualized using Visual Assessment of Tendency (VAT) plot. The evolution of these clusters are also shown. Furthermore, top 10 signature SNPs are identified in each cluster and the non-synonymous signature SNPs are visualised in the respective protein structures. Also, the sequence and structural homology-based prediction along with the protein structural stability of these non-synonymous signature SNPs are reported in order to judge the characteristics of the identified clusters. As a consequence, T85I, Q57H and R203M in NSP2, ORF3a and Nucleocapsid respectively are found to be responsible for Cluster 1 as they are damaging and unstable non-synonymous signature SNPs. Similarly, F506L and S507C in Exon are responsible for both Clusters 3 and 4 while Clusters 2 and 5 do not exhibit such behaviour due to the absence of any non-synonymous signature SNPs. In addition to all these, the code, SNP dataset, 10664 labelled SARS-CoV-2 strains and additional results as supplementary are provided through our website for further use.

Identifiants

pubmed: 33781798
pii: S0168-1702(21)00108-8
doi: 10.1016/j.virusres.2021.198401
pmc: PMC7997709
pii:
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

198401

Informations de copyright

Copyright © 2021 Elsevier B.V. All rights reserved.

Références

PLoS One. 2013 May 07;8(5):e63098
pubmed: 23667580
PLoS One. 2007 Jun 06;2(6):e512
pubmed: 17551592
PLoS One. 2014 Nov 05;9(11):e111988
pubmed: 25372567
Nat Methods. 2010 Apr;7(4):248-9
pubmed: 20354512
Infect Genet Evol. 2020 Nov;85:104457
pubmed: 32659347
Lancet Infect Dis. 2020 Sep;20(9):e238-e244
pubmed: 32628905
Methods Mol Biol. 2014;1079:105-16
pubmed: 24170397
Emerg Microbes Infect. 2020 Dec;9(1):1287-1299
pubmed: 32525765
Nature. 2020 Mar;579(7798):270-273
pubmed: 32015507
Infect Genet Evol. 2020 Sep;83:104351
pubmed: 32387564
Infect Genet Evol. 2020 Jul;81:104260
pubmed: 32092483
Lancet. 2020 Aug 15;396(10249):467-478
pubmed: 32702298
Mol Syst Biol. 2011 Oct 11;7:539
pubmed: 21988835
Bioinformatics. 2016 Jun 1;32(11):1740-2
pubmed: 26819473
Sci Rep. 2020 Sep 3;10(1):14542
pubmed: 32884013
Bioinformatics. 2015 Aug 15;31(16):2745-7
pubmed: 25851949
PLoS Negl Trop Dis. 2018 Jan 22;12(1):e0006182
pubmed: 29357361
Int Immunopharmacol. 2021 Feb;91:107276
pubmed: 33385714
Genet Epidemiol. 2021 Apr;45(3):316-323
pubmed: 33415739
BMC Infect Dis. 2017 Sep 11;17(1):615
pubmed: 28893197
Pathogens. 2019 Sep 30;8(4):
pubmed: 31574965
Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W306-10
pubmed: 15980478
Genomics. 2020 Sep;112(5):3588-3596
pubmed: 32353474

Auteurs

Nimisha Ghosh (N)

Department of Computer Science and Information Technology, Institute of Technical Education and Research, Siksha 'O' Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India.

Indrajit Saha (I)

Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, West Bengal, India. Electronic address: indrajit@nitttrkol.ac.in.

Nikhil Sharma (N)

Department of Electronics and Communication Engineering, Jaypee Institute of Information Technology, Noida, Uttar Pradesh, India.

Suman Nandi (S)

Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, West Bengal, India.

Dariusz Plewczynski (D)

Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland; Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, Poland.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH