Standardized Description of the Feature Extraction Process to Transform Raw Data Into Meaningful Information for Enhancing Data Reuse: Consensus Study.
Observation Medical Outcomes Partnership
algorithm
data reuse
data warehouse
database
feature extraction
Journal
JMIR medical informatics
ISSN: 2291-9694
Titre abrégé: JMIR Med Inform
Pays: Canada
ID NLM: 101645109
Informations de publication
Date de publication:
17 Oct 2022
17 Oct 2022
Historique:
received:
22
04
2022
accepted:
11
08
2022
revised:
19
07
2022
entrez:
17
10
2022
pubmed:
18
10
2022
medline:
18
10
2022
Statut:
epublish
Résumé
Despite the many opportunities data reuse offers, its implementation presents many difficulties, and raw data cannot be reused directly. Information is not always directly available in the source database and needs to be computed afterwards with raw data for defining an algorithm. The main purpose of this article is to present a standardized description of the steps and transformations required during the feature extraction process when conducting retrospective observational studies. A secondary objective is to identify how the features could be stored in the schema of a data warehouse. This study involved the following 3 main steps: (1) the collection of relevant study cases related to feature extraction and based on the automatic and secondary use of data; (2) the standardized description of raw data, steps, and transformations, which were common to the study cases; and (3) the identification of an appropriate table to store the features in the Observation Medical Outcomes Partnership (OMOP) common data model (CDM). We interviewed 10 researchers from 3 French university hospitals and a national institution, who were involved in 8 retrospective and observational studies. Based on these studies, 2 states (track and feature) and 2 transformations (track definition and track aggregation) emerged. "Track" is a time-dependent signal or period of interest, defined by a statistical unit, a value, and 2 milestones (a start event and an end event). "Feature" is time-independent high-level information with dimensionality identical to the statistical unit of the study, defined by a label and a value. The time dimension has become implicit in the value or name of the variable. We propose the 2 tables "TRACK" and "FEATURE" to store variables obtained in feature extraction and extend the OMOP CDM. We propose a standardized description of the feature extraction process. The process combined the 2 steps of track definition and track aggregation. By dividing the feature extraction into these 2 steps, difficulty was managed during track definition. The standardization of tracks requires great expertise with regard to the data, but allows the application of an infinite number of complex transformations. On the contrary, track aggregation is a very simple operation with a finite number of possibilities. A complete description of these steps could enhance the reproducibility of retrospective studies.
Sections du résumé
BACKGROUND
BACKGROUND
Despite the many opportunities data reuse offers, its implementation presents many difficulties, and raw data cannot be reused directly. Information is not always directly available in the source database and needs to be computed afterwards with raw data for defining an algorithm.
OBJECTIVE
OBJECTIVE
The main purpose of this article is to present a standardized description of the steps and transformations required during the feature extraction process when conducting retrospective observational studies. A secondary objective is to identify how the features could be stored in the schema of a data warehouse.
METHODS
METHODS
This study involved the following 3 main steps: (1) the collection of relevant study cases related to feature extraction and based on the automatic and secondary use of data; (2) the standardized description of raw data, steps, and transformations, which were common to the study cases; and (3) the identification of an appropriate table to store the features in the Observation Medical Outcomes Partnership (OMOP) common data model (CDM).
RESULTS
RESULTS
We interviewed 10 researchers from 3 French university hospitals and a national institution, who were involved in 8 retrospective and observational studies. Based on these studies, 2 states (track and feature) and 2 transformations (track definition and track aggregation) emerged. "Track" is a time-dependent signal or period of interest, defined by a statistical unit, a value, and 2 milestones (a start event and an end event). "Feature" is time-independent high-level information with dimensionality identical to the statistical unit of the study, defined by a label and a value. The time dimension has become implicit in the value or name of the variable. We propose the 2 tables "TRACK" and "FEATURE" to store variables obtained in feature extraction and extend the OMOP CDM.
CONCLUSIONS
CONCLUSIONS
We propose a standardized description of the feature extraction process. The process combined the 2 steps of track definition and track aggregation. By dividing the feature extraction into these 2 steps, difficulty was managed during track definition. The standardization of tracks requires great expertise with regard to the data, but allows the application of an infinite number of complex transformations. On the contrary, track aggregation is a very simple operation with a finite number of possibilities. A complete description of these steps could enhance the reproducibility of retrospective studies.
Identifiants
pubmed: 36251369
pii: v10i10e38936
doi: 10.2196/38936
pmc: PMC9623460
doi:
Types de publication
Journal Article
Langues
eng
Pagination
e38936Informations de copyright
©Antoine Lamer, Mathilde Fruchart, Nicolas Paris, Benjamin Popoff, Anaïs Payen, Thibaut Balcaen, William Gacquer, Guillaume Bouzillé, Marc Cuggia, Matthieu Doutreligne, Emmanuel Chazard. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 17.10.2022.
Références
J Am Med Inform Assoc. 2007 Jan-Feb;14(1):1-9
pubmed: 17077452
Circulation. 2000 Jun 13;101(23):E215-20
pubmed: 10851218
JCO Clin Cancer Inform. 2021 Jan;5:12-20
pubmed: 33411620
EURASIP J Wirel Commun Netw. 2017;2017(1):211
pubmed: 29263717
Lancet. 2019 Nov 16;394(10211):1816-1826
pubmed: 31668726
Ann Intern Med. 2010 Nov 2;153(9):600-6
pubmed: 21041580
Comput Methods Programs Biomed. 2016 Jun;129:160-71
pubmed: 26817405
J Am Med Inform Assoc. 2016 Sep;23(5):909-15
pubmed: 26911824
Database (Oxford). 2016 Feb 11;2016:
pubmed: 26868052
Yearb Med Inform. 2017 Aug;26(1):38-52
pubmed: 28480475
EGEMS (Wash DC). 2017 Sep 04;5(1):14
pubmed: 29881734
J Am Med Inform Assoc. 2013 Jan 1;20(1):144-51
pubmed: 22733976
Stud Health Technol Inform. 2018;255:15-19
pubmed: 30306898
Eur J Emerg Med. 2021 Dec 01;28(6):469-475
pubmed: 34285171
J Biomed Inform. 2016 Dec;64:333-341
pubmed: 27989817
Nature. 2015 May 28;521(7553):436-44
pubmed: 26017442
J Am Med Inform Assoc. 2014 Jul-Aug;21(4):596-601
pubmed: 24821738
J Gen Intern Med. 2013 Dec;28(12):1545-6
pubmed: 23838902
Nat Commun. 2020 Oct 6;11(1):5009
pubmed: 33024121
Appl Clin Inform. 2012 Jul 11;3(3):276-89
pubmed: 23620720
Indian J Pharmacol. 2012 Mar;44(2):168-72
pubmed: 22529469
J Am Med Inform Assoc. 2013 Jan 1;20(1):117-21
pubmed: 22955496
J Biomed Inform. 2014 Oct;51:86-99
pubmed: 24747879
Eur J Clin Pharmacol. 2007 Aug;63(8):725-31
pubmed: 17554532
Biostat Epidemiol. 2020;4(1):6-14
pubmed: 32258941
J Biomed Inform. 2019 Aug;96:103239
pubmed: 31238109
Health Aff (Millwood). 2015 Dec;34(12):2174-80
pubmed: 26561387
J Biomed Inform. 2014 Oct;51:24-34
pubmed: 24727481
Appl Clin Inform. 2015 Sep 09;6(3):565-76
pubmed: 26448798
Drug Saf. 2015 Aug;38(8):749-65
pubmed: 26055920
J Am Med Inform Assoc. 2014 Jul-Aug;21(4):602-6
pubmed: 24821737
J Am Med Inform Assoc. 2012 Jan-Feb;19(1):54-60
pubmed: 22037893
Yearb Med Inform. 2014 Aug 15;9:52-4
pubmed: 25123722
Yearb Med Inform. 2016 Nov 10;(1):211-218
pubmed: 27830253
Stud Health Technol Inform. 2019;257:460-467
pubmed: 30741240
Anesthesiology. 2020 Apr;132(4):723-737
pubmed: 32022770
BMC Med Inform Decis Mak. 2014 Jun 25;14:55
pubmed: 24965680
Stud Health Technol Inform. 2015;216:574-8
pubmed: 26262116
Stud Health Technol Inform. 2017;236:204-210
pubmed: 28508797
BMC Bioinformatics. 2019 Mar 4;20(1):110
pubmed: 30832568
J Biomed Semantics. 2016 Feb 09;7:3
pubmed: 26865946
Stud Health Technol Inform. 2018;255:60-64
pubmed: 30306907
J Clin Monit Comput. 2021 May;35(3):617-626
pubmed: 32418147
J Biomed Inform. 2015 Feb;53:162-73
pubmed: 25463966
Clin Pharmacol Ther. 2016 Aug;100(2):147-59
pubmed: 26916672
J Biomed Inform. 2016 Apr;60:352-62
pubmed: 26944737
Sci Data. 2016 May 24;3:160035
pubmed: 27219127
JAMA. 2011 Aug 24;306(8):848-55
pubmed: 21862746
Stud Health Technol Inform. 2016;221:102-6
pubmed: 27071886
Int J Epidemiol. 2017 Apr 1;46(2):392-392d
pubmed: 28168290
Therapie. 2019 Apr;74(2):215-223
pubmed: 30392702
Anesth Analg. 2017 May;124(5):1423-1430
pubmed: 28431419
JMIR Med Inform. 2021 Apr 6;9(4):e25035
pubmed: 33720842
JMIR Med Inform. 2021 Jan 20;9(1):e20862
pubmed: 33470938