CoVEffect: interactive system for mining the effects of SARS-CoV-2 mutations and variants based on deep learning.

CORD-19 dataset SARS-CoV-2 deep learning language models machine learning interpretability viral mutations viral variants web interface

Journal

GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872

Informations de publication

Date de publication:
28 12 2022
Historique:
received: 05 12 2022
revised: 11 04 2023
accepted: 27 04 2023
medline: 26 5 2023
pubmed: 24 5 2023
entrez: 24 5 2023
Statut: ppublish

Résumé

Literature about SARS-CoV-2 widely discusses the effects of variations that have spread in the past 3 years. Such information is dispersed in the texts of several research articles, hindering the possibility of practically integrating it with related datasets (e.g., millions of SARS-CoV-2 sequences available to the community). We aim to fill this gap, by mining literature abstracts to extract-for each variant/mutation-its related effects (in epidemiological, immunological, clinical, or viral kinetics terms) with labeled higher/lower levels in relation to the nonmutated virus. The proposed framework comprises (i) the provisioning of abstracts from a COVID-19-related big data corpus (CORD-19) and (ii) the identification of mutation/variant effects in abstracts using a GPT2-based prediction model. The above techniques enable the prediction of mutations/variants with their effects and levels in 2 distinct scenarios: (i) the batch annotation of the most relevant CORD-19 abstracts and (ii) the on-demand annotation of any user-selected CORD-19 abstract through the CoVEffect web application (http://gmql.eu/coveffect), which assists expert users with semiautomated data labeling. On the interface, users can inspect the predictions and correct them; user inputs can then extend the training dataset used by the prediction model. Our prototype model was trained through a carefully designed process, using a minimal and highly diversified pool of samples. The CoVEffect interface serves for the assisted annotation of abstracts, allowing the download of curated datasets for further use in data integration or analysis pipelines. The overall framework can be adapted to resolve similar unstructured-to-structured text translation tasks, which are typical of biomedical domains.

Sections du résumé

BACKGROUND
Literature about SARS-CoV-2 widely discusses the effects of variations that have spread in the past 3 years. Such information is dispersed in the texts of several research articles, hindering the possibility of practically integrating it with related datasets (e.g., millions of SARS-CoV-2 sequences available to the community). We aim to fill this gap, by mining literature abstracts to extract-for each variant/mutation-its related effects (in epidemiological, immunological, clinical, or viral kinetics terms) with labeled higher/lower levels in relation to the nonmutated virus.
RESULTS
The proposed framework comprises (i) the provisioning of abstracts from a COVID-19-related big data corpus (CORD-19) and (ii) the identification of mutation/variant effects in abstracts using a GPT2-based prediction model. The above techniques enable the prediction of mutations/variants with their effects and levels in 2 distinct scenarios: (i) the batch annotation of the most relevant CORD-19 abstracts and (ii) the on-demand annotation of any user-selected CORD-19 abstract through the CoVEffect web application (http://gmql.eu/coveffect), which assists expert users with semiautomated data labeling. On the interface, users can inspect the predictions and correct them; user inputs can then extend the training dataset used by the prediction model. Our prototype model was trained through a carefully designed process, using a minimal and highly diversified pool of samples.
CONCLUSIONS
The CoVEffect interface serves for the assisted annotation of abstracts, allowing the download of curated datasets for further use in data integration or analysis pipelines. The overall framework can be adapted to resolve similar unstructured-to-structured text translation tasks, which are typical of biomedical domains.

Identifiants

pubmed: 37222749
pii: 7176211
doi: 10.1093/gigascience/giad036
pmc: PMC10205000
pii:
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2023. Published by Oxford University Press GigaScience.

Références

Lancet Microbe. 2020 Jul;1(3):e99-e100
pubmed: 32835336
J Pharm Anal. 2022 Feb;12(1):58-64
pubmed: 34545316
Bioinformatics. 2022 Mar 4;38(6):1735-1737
pubmed: 34954792
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W518-22
pubmed: 23703206
Virus Evol. 2022 Mar 18;8(1):veac023
pubmed: 35502202
Patterns (N Y). 2023 Apr 14;4(4):100729
pubmed: 37123444
Nucleic Acids Res. 2022 Jan 7;50(D1):D771-D776
pubmed: 34643704
J Am Med Inform Assoc. 2016 Jul;23(4):766-72
pubmed: 27121612
Nucleic Acids Res. 2022 Jan 7;50(D1):D858-D866
pubmed: 34761257
Mol Biol Evol. 2021 May 19;38(6):2547-2565
pubmed: 33605421
Nat Med. 2022 Jun;28(6):1110-1115
pubmed: 35637337
AMIA Annu Symp Proc. 2022 Feb 21;2021:833-842
pubmed: 35308981
Brief Bioinform. 2021 Sep 2;22(5):
pubmed: 33834200
J Biomed Inform. 2022 Feb;126:103982
pubmed: 34974190
Bioinformatics. 2021 Apr 5;36(24):5678-5685
pubmed: 33416851
Sci Rep. 2021 Oct 26;11(1):21068
pubmed: 34702903
NPJ Digit Med. 2022 Dec 21;5(1):186
pubmed: 36544046
Front Microbiol. 2022 Mar 16;13:859241
pubmed: 35369526
JAMIA Open. 2022 Jun 11;5(2):ooac043
pubmed: 35702625
Virol J. 2021 Apr 28;18(1):87
pubmed: 33910569
Brief Bioinform. 2017 Sep 1;18(5):851-869
pubmed: 27473064
Nature. 2020 Dec;588(7839):553
pubmed: 33328621
Nat Methods. 2023 Apr;20(4):512-522
pubmed: 36823332
J Virol. 2021 Jul 26;95(16):e0061721
pubmed: 34105996
Cell. 2020 Aug 20;182(4):812-827.e19
pubmed: 32697968
Biology (Basel). 2022 Dec 08;11(12):
pubmed: 36552295
Brief Bioinform. 2021 Jul 20;22(4):
pubmed: 33005921
mBio. 2021 Oct 26;12(5):e0251021
pubmed: 34607452
Nature. 2021 Jul;595(7869):707-712
pubmed: 34098568
Bioinformatics. 2021 Apr 20;37(3):404-412
pubmed: 32810217
Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758
pubmed: 33897979
Front Microbiol. 2022 Mar 03;13:819745
pubmed: 35308391
Bioinformatics. 2023 Jan 1;39(1):
pubmed: 36342236
JMIR Med Inform. 2019 Apr 27;7(2):e12239
pubmed: 31066697
Database (Oxford). 2019 Jan 1;2019:
pubmed: 31800044
J Biomed Inform. 2018 Dec;88:11-19
pubmed: 30368002
iScience. 2022 Mar 18;25(3):103939
pubmed: 35194576
Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5
pubmed: 23193258
Bioinformatics. 2022 Mar 28;38(7):1988-1994
pubmed: 35040923
Bioinformatics. 2022 Mar 4;38(6):1648-1656
pubmed: 34986221
Euro Surveill. 2017 Mar 30;22(13):
pubmed: 28382917
Nucleic Acids Res. 2023 Jan 6;51(D1):D141-D144
pubmed: 36350640
J Med Syst. 2018 Jun 28;42(8):139
pubmed: 29956014
Bioinform Adv. 2023 Jan 11;3(1):vbad001
pubmed: 36845200
Nucleic Acids Res. 2021 Sep 7;49(15):e90
pubmed: 34107016
IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):1269-1277
pubmed: 35471885
PLoS Comput Biol. 2016 Nov 30;12(11):e1005017
pubmed: 27902695
Cell Host Microbe. 2022 Mar 9;30(3):373-387.e7
pubmed: 35150638
Database (Oxford). 2022 Jun 3;2022:
pubmed: 35657113
Cell. 2020 Sep 3;182(5):1284-1294.e9
pubmed: 32730807
Sci Data. 2022 Jun 1;9(1):260
pubmed: 35650205
Database (Oxford). 2014 Sep 22;2014:
pubmed: 25246425
Bioinformatics. 2022 Mar 4;38(6):1776-1778
pubmed: 34983060
PLoS Biol. 2021 Mar 24;19(3):e3001006
pubmed: 33760807
JAMA. 2021 Feb 9;325(6):529-531
pubmed: 33404586
Nat Microbiol. 2020 Nov;5(11):1403-1407
pubmed: 32669681
Nature. 2020 Mar;579(7798):265-269
pubmed: 32015508

Auteurs

Giuseppe Serna García (G)

Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy.

Ruba Al Khalaf (R)

Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy.

Francesco Invernici (F)

Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy.

Stefano Ceri (S)

Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy.

Anna Bernasconi (A)

Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH