Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study.

COVID-19 critical analysis data data science data synthesis database decision making infodemic infrastructure literature methodology misinformation pipeline research structured data synthesis web crawl data

Journal

Journal of medical Internet research
ISSN: 1438-8871
Titre abrégé: J Med Internet Res
Pays: Canada
ID NLM: 100959882

Informations de publication

Date de publication:
06 05 2021
Historique:
received: 12 11 2020
accepted: 03 04 2021
revised: 30 12 2020
pubmed: 10 4 2021
medline: 25 5 2021
entrez: 9 4 2021
Statut: epublish

Résumé

The scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented "infodemic"; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis-related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query. The main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19-related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data. To create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources. REDASA (Realtime Data Synthesis and Analysis) is now one of the world's largest and most up-to-date sources of COVID-19-related evidence; it consists of 104,000 documents. By capturing curators' critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19-related information and represent around 10% of all papers about COVID-19. This data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA's design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers' critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world's largest COVID-19-related data corpora for searches and curation.

Sections du résumé

BACKGROUND
The scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented "infodemic"; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis-related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query.
OBJECTIVE
The main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19-related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data.
METHODS
To create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources.
RESULTS
REDASA (Realtime Data Synthesis and Analysis) is now one of the world's largest and most up-to-date sources of COVID-19-related evidence; it consists of 104,000 documents. By capturing curators' critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19-related information and represent around 10% of all papers about COVID-19.
CONCLUSIONS
This data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA's design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers' critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world's largest COVID-19-related data corpora for searches and curation.

Identifiants

pubmed: 33835932
pii: v23i5e25714
doi: 10.2196/25714
pmc: PMC8104004
doi:

Types de publication

Journal Article Validation Study

Langues

eng

Sous-ensembles de citation

IM

Pagination

e25714

Investigateurs

Ademola Adeyeye (A)
Ahmed Ezzat (A)
Alberto Porcu (A)
Alexander Walmsley (A)
Ali Farsi (A)
Alison Faye O Chan (A)
Aminah Abdul Razzack (A)
Andee Dzulkarnaen Zakaria (A)
Andrew Yiu (A)
Antonios Soliman (A)
Ariana Axiaq (A)
Avinash Aujayeb (A)
Catherine Dominic (C)
Eduarda Sá-Marta (E)
Eunice F Nolasco (E)
Jessamine Edith S Ferrer (J)
Jonathan Anthony Kat (J)
Josephine Holt (J)
Kamal Awad (K)
Kirk Chalmers (K)
Mina Ragheb (M)
Muhammad Khawar Sana (M)
Niraj Sandeep Kumar (N)
Roland Amoah (R)
Semra Demirli Atici (S)
Shane Charles (S)
Sunnia Ahmed (S)
Teresa Perra (T)
Tricia Tay (T)
Ubaid Ullah (U)
Zara Ahmed (Z)
Zun Zheng Ong (Z)

Informations de copyright

©Uddhav Vaghela, Simon Rabinowicz, Paris Bratsos, Guy Martin, Epameinondas Fritzilas, Sheraz Markar, Sanjay Purkayastha, Karl Stringer, Harshdeep Singh, Charlie Llewellyn, Debabrata Dutta, Jonathan M Clarke, Matthew Howard, PanSurg REDASA Curators, Ovidiu Serban, James Kinross. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 06.05.2021.

Références

PLoS Med. 2014 Feb 18;11(2):e1001603
pubmed: 24558353
J Med Internet Res. 2020 Jan 17;22(1):e15415
pubmed: 31951213
Wellcome Open Res. 2020 Apr 2;5:60
pubmed: 32292826
J Med Internet Res. 2020 Jul 24;22(7):e17853
pubmed: 32706701

Auteurs

Uddhav Vaghela (U)

PanSurg Collaborative, Department of Surgery and Cancer, Imperial College London, London, United Kingdom.

Simon Rabinowicz (S)

PanSurg Collaborative, Department of Surgery and Cancer, Imperial College London, London, United Kingdom.

Paris Bratsos (P)

PanSurg Collaborative, Department of Surgery and Cancer, Imperial College London, London, United Kingdom.

Guy Martin (G)

PanSurg Collaborative, Department of Surgery and Cancer, Imperial College London, London, United Kingdom.

Epameinondas Fritzilas (E)

Amazon Web Services UK Limited, London, United Kingdom.

Sheraz Markar (S)

PanSurg Collaborative, Department of Surgery and Cancer, Imperial College London, London, United Kingdom.

Sanjay Purkayastha (S)

PanSurg Collaborative, Department of Surgery and Cancer, Imperial College London, London, United Kingdom.

Karl Stringer (K)

MirrorWeb Limited, Manchester, United Kingdom.

Harshdeep Singh (H)

Cloudwick Technologies, Newark, CA, United States.

Charlie Llewellyn (C)

Amazon Web Services UK Limited, London, United Kingdom.

Debabrata Dutta (D)

Cloudwick Technologies, Newark, CA, United States.

Jonathan M Clarke (JM)

PanSurg Collaborative, Department of Surgery and Cancer, Imperial College London, London, United Kingdom.

Matthew Howard (M)

Amazon Web Services UK Limited, London, United Kingdom.
PanSurg Collaborative, Department of Surgery and Cancer, Imperial College London, London, United Kingdom.
see Acknowledgments, London, United Kingdom.

Ovidiu Serban (O)

Data Science Institute, Imperial College London, London, United Kingdom.

James Kinross (J)

PanSurg Collaborative, Department of Surgery and Cancer, Imperial College London, London, United Kingdom.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH