High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP).
Journal
Nature protocols
ISSN: 1750-2799
Titre abrégé: Nat Protoc
Pays: England
ID NLM: 101284307
Informations de publication
Date de publication:
12 2019
12 2019
Historique:
received:
19
10
2018
accepted:
22
07
2019
pubmed:
22
11
2019
medline:
8
2
2020
entrez:
22
11
2019
Statut:
ppublish
Résumé
Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1-2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).
Identifiants
pubmed: 31748751
doi: 10.1038/s41596-019-0227-6
pii: 10.1038/s41596-019-0227-6
pmc: PMC7323894
mid: NIHMS1594913
doi:
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Langues
eng
Sous-ensembles de citation
IM
Pagination
3426-3444Subventions
Organisme : CSRD VA
ID : I01 CX001025
Pays : United States
Organisme : NIAMS NIH HHS
ID : P30 AR072577
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01 HG009174
Pays : United States
Organisme : NINDS NIH HHS
ID : R01 NS098023
Pays : United States
Organisme : NIAMS NIH HHS
ID : T32 AR007530
Pays : United States
Organisme : NLM NIH HHS
ID : U54 LM008748
Pays : United States
Références
Brownstein, J. S. et al. Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes Care 33, 526–531 (2010).
pubmed: 20009093
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
pubmed: 24270849
pmcid: 3969265
Kurreeman, F. et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. Am. J. Hum. Genet. 88, 57–69 (2011).
pubmed: 21211616
pmcid: 3014362
Liao, K. P. et al. Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non-rheumatoid arthritis controls. Arthritis Rheumatol. 65, 571–581 (2013).
Canela-Xandri, O. et al. An atlas of genetic associations in UK Biobank. Nat. Genet. 50, 1593–1599 (2018).
pubmed: 30349118
pmcid: 6707814
Gaziano, J. M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
pubmed: 26441289
Banda, J. M. et al. Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network. AMIA Jt. Summit. Transl. Sci. Proc. 2017, (48–57 (2017).
Kho, A. N. et al. Electronic medical records for genetic research: results of the eMERGE consortium. Sci. Transl. Med. 3, 79re71 (2011).
Kirby, J. C. et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inform. Assoc. 23, 1046–1052 (2016).
pubmed: 27026615
pmcid: 5070514
O’Malley, K. J. et al. Measuring diagnoses: ICD code accuracy. Health Serv. Res. 40, 1620–1639 (2005).
pubmed: 16178999
pmcid: 1361216
Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care. Res. 62, 1120–1127 (2010).
Liao, K. P. et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ 350, h1885 (2015).
pubmed: 25911572
pmcid: 4707569
Yu, S. et al. Surrogate-assisted feature extraction for high-throughput phenotyping. J. Am. Med. Inform. Assoc. 24, e143–e149 (2017).
pubmed: 27632993
Yu, S. et al. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. J. Am. Med. Inform. Assoc. 22, 993–1000 (2015).
pubmed: 25929596
pmcid: 4986664
Castro, V. M. et al. Validation of electronic health record phenotyping of bipolar disorder cases and controls. Am. J. Psychiatry 172, 363–372 (2015).
pubmed: 25827034
Murphy, S. N. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17, 124–130 (2010).
pubmed: 20190053
pmcid: 3000779
Son, J. H. et al. Deep phenotyping on electronic health records facilitates genetic diagnosis by clinical exomes. Am. J. Hum. Genet. 103, 58–73 (2018).
pubmed: 29961570
pmcid: 6035281
Rasmussen, L. V. et al. Design patterns for the development of electronic health record-driven phenotype extraction algorithms. J. Biomed. Inform. 51, 280–286 (2014).
pubmed: 24960203
Basile, A. O. et al. Informatics and machine learning to define the phenotype. Expert. Rev. Mol. Diagn. 18, 219–226 (2018).
pubmed: 29431517
pmcid: 6080627
Ananthakrishnan, A. N. et al. Improving case definition of Crohn’s disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach. Inflamm. Bowel. Dis. 19, 1411–1420 (2013).
pubmed: 23567779
pmcid: 3665760
Carroll, R. J. et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. J. Am. Med. Inform. Assoc. 19, e162–e169 (2012).
pubmed: 22374935
pmcid: 3392871
Xia, Z. et al. Modeling disease severity in multiple sclerosis using electronic health records. PLoS One 8, e78927 (2013).
pubmed: 24244385
pmcid: 3823928
Ananthakrishnan, A. N. et al. Association between reduced plasma 25-hydroxy vitamin D and increased risk of cancer in patients with inflammatory bowel diseases. Clin. Gastroenterol. Hepatol. 12, 821–827 (2014).
pubmed: 24161349
Cai, T. et al. The association between arthralgia and vedolizumab using natural language processing. Inflamm. Bowel. Dis. 24, 2242–2246 (2018).
pubmed: 29846617
pmcid: 6140445
Liao, K. P. et al. Association between low density lipoprotein and rheumatoid arthritis genetic factors with low density lipoprotein levels in rheumatoid arthritis and non-rheumatoid arthritis controls. Ann. Rheum. Dis. 73, 1170–1175 (2014).
pubmed: 23716066
Kurreeman, F. A. et al. Use of a multiethnic approach to identify rheumatoid- arthritis-susceptibility loci, 1p36 and 17q12. Am. J. Hum. Genet. 90, 524–532 (2012).
pubmed: 22365150
pmcid: 3309197
Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014).
pubmed: 24390342
Ananthakrishnan, A. N. et al. Common genetic variants influence circulating vitamin D levels in inflammatory bowel diseases. Inflamm. Bowel. Dis. 21, 2507–2514 (2015).
pubmed: 26241000
pmcid: 4615315
Sinnott, J. A. et al. Improving the power of genetic association tests with imperfect phenotype derived from electronic medical records. Hum. Genet. 133, 1369–1382 (2014).
pubmed: 25062868
pmcid: 4185241
Halpern, Y. et al. Electronic medical record phenotyping using the anchor and learn framework. J. Am. Med. Inform. Assoc. 23, 731–740 (2016).
pubmed: 27107443
pmcid: 4926745
Agarwal, V. et al. Learning statistical models of phenotypes using noisy labeled training data. J. Am. Med. Inform. Assoc. 23, 1166–1173 (2016).
pubmed: 27174893
pmcid: 5070523
Yu, S. et al. Enabling phenotypic big data with PheNorm. J. Am. Med. Inform. Assoc. 25, 54–60 (2018).
pubmed: 29126253
Lindberg, D. A. et al. The Unified Medical Language System. Methods Inf. Med. 32, 281–291 (1993).
pubmed: 8412823
pmcid: 6693515
Jupp, S., Burdett, T., Leroy, C. & Parkinson, H. A new ontology lookup service at EMBL-EBI. CEUR Workshop Proc. 1546, 118–119 (2015).
Savova, G. K. et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 17, 507–513 (2010).
pubmed: 20819853
pmcid: 2995668
Goryachev, S. et al. A suite of natural language processing tools developed for the I2B2 project. AMIA Annu. Symp. Proc. 2006, 931 (2006).
pmcid: 1839726
Liu, H. D., Wagholikar, K., Jonnalagadda, S. & Sohn, S. Integrated cTAKES for concept mention detection and normalization. In CEUR Workshop Proceedings, Vol. 1179 (CEUR-WS, 2013).
Aronson, A. R. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc. AMIA Symp. 17-21 (2001).
Yu, S. et al. NILE: fast natural language processing for electronic health records. Preprint at https://arxiv.org/abs/1311.6063 (2013).
Manning, C. et al. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55-60 (Association for Computational Linguistics, 2014).
Chapman, W. W. et al. A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform. 34, 301–310 (2001).
pubmed: 12123149
Castro, V. M. et al. Large-scale identification of patients with cerebral aneurysms using natural language processing. Neurology 88, 164–168 (2017).
pubmed: 27927935
pmcid: 5224711
Castro, V. M. et al. Identification of subjects with polycystic ovary syndrome using electronic health records. Reprod. Biol. Endocrinol. 13, 116 (2015).
pubmed: 26510685
pmcid: 4625743
Jorge, A. et al. Identifying lupus patients in electronic health records: development and validation of machine learning algorithms and application of rule-based algorithms. Semin. Arthritis Rheum. 49, 84–90 (2019).
pubmed: 30665626
Perlis, R. H. et al. Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. Psychol. Med. 42, 41–50 (2012).
pubmed: 21682950
Doss, J., Mo, H., Carroll, R. J., Crofford, L. J. & Denny, J. C. Phenome-wide association study of rheumatoid arthritis subgroups identifies association between seronegative disease and fibromyalgia. Arthritis Rheumatol. 69, 291–300 (2017).
pubmed: 27589350
pmcid: 5274573
Geva, A. et al. A computable phenotype improves cohort ascertainment in a pediatric pulmonary hypertension registry. J. Pediatr. 188, 224–231 (2017).
pubmed: 28625502
pmcid: 5572538