Assessing the Performance of Clinical Natural Language Processing Systems: Development of an Evaluation Methodology.

clinical natural language processing electronic health records gold standard natural language processing reference standard sample size

Journal

JMIR medical informatics
ISSN: 2291-9694
Titre abrégé: JMIR Med Inform
Pays: Canada
ID NLM: 101645109

Informations de publication

Date de publication:
23 Jul 2021
Historique:
received: 20 05 2020
accepted: 17 06 2021
revised: 31 07 2020
entrez: 23 7 2021
pubmed: 24 7 2021
medline: 24 7 2021
Statut: epublish

Résumé

Clinical natural language processing (cNLP) systems are of crucial importance due to their increasing capability in extracting clinically important information from free text contained in electronic health records (EHRs). The conversion of a nonstructured representation of a patient's clinical history into a structured format enables medical doctors to generate clinical knowledge at a level that was not possible before. Finally, the interpretation of the insights gained provided by cNLP systems has a great potential in driving decisions about clinical practice. However, carrying out robust evaluations of those cNLP systems is a complex task that is hindered by a lack of standard guidance on how to systematically approach them. Our objective was to offer natural language processing (NLP) experts a methodology for the evaluation of cNLP systems to assist them in carrying out this task. By following the proposed phases, the robustness and representativeness of the performance metrics of their own cNLP systems can be assured. The proposed evaluation methodology comprised five phases: (1) the definition of the target population, (2) the statistical document collection, (3) the design of the annotation guidelines and annotation project, (4) the external annotations, and (5) the cNLP system performance evaluation. We presented the application of all phases to evaluate the performance of a cNLP system called "EHRead Technology" (developed by Savana, an international medical company), applied in a study on patients with asthma. As part of the evaluation methodology, we introduced the Sample Size Calculator for Evaluations (SLiCE), a software tool that calculates the number of documents needed to achieve a statistically useful and resourceful gold standard. The application of the proposed evaluation methodology on a real use-case study of patients with asthma revealed the benefit of the different phases for cNLP system evaluations. By using SLiCE to adjust the number of documents needed, a meaningful and resourceful gold standard was created. In the presented use-case, using as little as 519 EHRs, it was possible to evaluate the performance of the cNLP system and obtain performance metrics for the primary variable within the expected CIs. We showed that our evaluation methodology can offer guidance to NLP experts on how to approach the evaluation of their cNLP systems. By following the five phases, NLP experts can assure the robustness of their evaluation and avoid unnecessary investment of human and financial resources. Besides the theoretical guidance, we offer SLiCE as an easy-to-use, open-source Python library.

Sections du résumé

BACKGROUND BACKGROUND
Clinical natural language processing (cNLP) systems are of crucial importance due to their increasing capability in extracting clinically important information from free text contained in electronic health records (EHRs). The conversion of a nonstructured representation of a patient's clinical history into a structured format enables medical doctors to generate clinical knowledge at a level that was not possible before. Finally, the interpretation of the insights gained provided by cNLP systems has a great potential in driving decisions about clinical practice. However, carrying out robust evaluations of those cNLP systems is a complex task that is hindered by a lack of standard guidance on how to systematically approach them.
OBJECTIVE OBJECTIVE
Our objective was to offer natural language processing (NLP) experts a methodology for the evaluation of cNLP systems to assist them in carrying out this task. By following the proposed phases, the robustness and representativeness of the performance metrics of their own cNLP systems can be assured.
METHODS METHODS
The proposed evaluation methodology comprised five phases: (1) the definition of the target population, (2) the statistical document collection, (3) the design of the annotation guidelines and annotation project, (4) the external annotations, and (5) the cNLP system performance evaluation. We presented the application of all phases to evaluate the performance of a cNLP system called "EHRead Technology" (developed by Savana, an international medical company), applied in a study on patients with asthma. As part of the evaluation methodology, we introduced the Sample Size Calculator for Evaluations (SLiCE), a software tool that calculates the number of documents needed to achieve a statistically useful and resourceful gold standard.
RESULTS RESULTS
The application of the proposed evaluation methodology on a real use-case study of patients with asthma revealed the benefit of the different phases for cNLP system evaluations. By using SLiCE to adjust the number of documents needed, a meaningful and resourceful gold standard was created. In the presented use-case, using as little as 519 EHRs, it was possible to evaluate the performance of the cNLP system and obtain performance metrics for the primary variable within the expected CIs.
CONCLUSIONS CONCLUSIONS
We showed that our evaluation methodology can offer guidance to NLP experts on how to approach the evaluation of their cNLP systems. By following the five phases, NLP experts can assure the robustness of their evaluation and avoid unnecessary investment of human and financial resources. Besides the theoretical guidance, we offer SLiCE as an easy-to-use, open-source Python library.

Identifiants

pubmed: 34297002
pii: v9i7e20492
doi: 10.2196/20492
pmc: PMC8367121
doi:

Types de publication

Journal Article

Langues

eng

Pagination

e20492

Informations de copyright

©Lea Canales, Sebastian Menke, Stephanie Marchesseau, Ariel D’Agostino, Carlos del Rio-Bermudez, Miren Taberna, Jorge Tello. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 23.07.2021.

Références

Proc AMIA Symp. 2001;:17-21
pubmed: 11825149
Ticks Tick Borne Dis. 2019 Feb;10(2):241-250
pubmed: 30420251
Yearb Med Inform. 2018 Aug;27(1):184-192
pubmed: 30157522
Proc AMIA Annu Fall Symp. 1997;:595-9
pubmed: 9357695
Methods Inf Med. 1998 Nov;37(4-5):334-44
pubmed: 9865031
J Pharm Policy Pract. 2020 Nov 9;13(1):75
pubmed: 33292570
Stud Health Technol Inform. 2018;247:111-115
pubmed: 29677933
JMIR Med Inform. 2019 Apr 21;7(2):e12109
pubmed: 31066686
J Am Med Inform Assoc. 2018 Mar 1;25(3):331-336
pubmed: 29186491
PLoS One. 2019 Mar 28;14(3):e0214465
pubmed: 30921400
Appl Clin Inform. 2019 Aug;10(4):655-669
pubmed: 31486057
Pac Symp Biocomput. 2015;:282-93
pubmed: 25592589
Stud Health Technol Inform. 2017;245:298-302
pubmed: 29295103
AMIA Jt Summits Transl Sci Proc. 2017 Jul 26;2017:203-212
pubmed: 28815130
Eur Respir J. 2021 Mar 4;57(3):
pubmed: 33154029
Sci Rep. 2017 Apr 07;7:46226
pubmed: 28387314
AMA J Ethics. 2017 Mar 1;19(3):281-288
pubmed: 28323609
Arch Bronconeumol (Engl Ed). 2021 Feb;57(2):94-100
pubmed: 32098727
Sci Rep. 2016 May 17;6:26094
pubmed: 27185194
Drug Saf. 2019 Jan;42(1):123-133
pubmed: 30600484
World J Surg. 2011 Mar;35(3):500-4
pubmed: 21190114
J Am Med Inform Assoc. 2020 Mar 1;27(3):457-470
pubmed: 31794016
Yearb Med Inform. 2015 Aug 13;10(1):183-93
pubmed: 26293867
Drug Saf. 2017 Nov;40(11):1075-1089
pubmed: 28643174
Yearb Med Inform. 2015 Aug 13;10(1):194-8
pubmed: 26293868
J Biomed Inform. 2018 Dec;88:11-19
pubmed: 30368002
J Am Med Inform Assoc. 2018 May 1;25(5):530-537
pubmed: 29361077
J Am Med Inform Assoc. 2010 Sep-Oct;17(5):507-13
pubmed: 20819853
J Am Med Inform Assoc. 1994 Mar-Apr;1(2):142-60
pubmed: 7719796
JAMA. 2017 Oct 3;318(13):1241-1249
pubmed: 28903154
J Am Med Inform Assoc. 2017 Jan;24(1):198-208
pubmed: 27189013
Nat Rev Genet. 2012 May 02;13(6):395-405
pubmed: 22549152
J Investig Allergol Clin Immunol. 2021 Jul 26;31(4):308-315
pubmed: 31983679
J Clin Med. 2020 Oct 12;9(10):
pubmed: 33053774
J Womens Health (Larchmt). 2021 Mar;30(3):393-404
pubmed: 33416429
J Am Med Inform Assoc. 2005 May-Jun;12(3):296-8
pubmed: 15684123
J Biomed Inform. 2013 Oct;46(5):765-73
pubmed: 23810857
Nat Med. 2021 Apr;27(4):582-584
pubmed: 33820998
BMC Med Inform Decis Mak. 2006 Jul 26;6:30
pubmed: 16872495
Int J Med Inform. 2019 Jul;127:141-146
pubmed: 31128826
JAMA. 2014 Jun 25;311(24):2479-80
pubmed: 24854141
J Biomed Inform. 2018 Jan;77:34-49
pubmed: 29162496
J Med Internet Res. 2020 Oct 28;22(10):e21801
pubmed: 33090964

Auteurs

Lea Canales (L)

Department of Software and Computing System, University of Alicante, Alicante, Spain.

Miren Taberna (M)

MedSavana SL, Madrid, Spain.

Jorge Tello (J)

MedSavana SL, Madrid, Spain.

Classifications MeSH