High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP).


Journal

Nature protocols
ISSN: 1750-2799
Titre abrégé: Nat Protoc
Pays: England
ID NLM: 101284307

Informations de publication

Date de publication:
12 2019
Historique:
received: 19 10 2018
accepted: 22 07 2019
pubmed: 22 11 2019
medline: 8 2 2020
entrez: 22 11 2019
Statut: ppublish

Résumé

Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1-2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).

Identifiants

pubmed: 31748751
doi: 10.1038/s41596-019-0227-6
pii: 10.1038/s41596-019-0227-6
pmc: PMC7323894
mid: NIHMS1594913
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Pagination

3426-3444

Subventions

Organisme : CSRD VA
ID : I01 CX001025
Pays : United States
Organisme : NIAMS NIH HHS
ID : P30 AR072577
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01 HG009174
Pays : United States
Organisme : NINDS NIH HHS
ID : R01 NS098023
Pays : United States
Organisme : NIAMS NIH HHS
ID : T32 AR007530
Pays : United States
Organisme : NLM NIH HHS
ID : U54 LM008748
Pays : United States

Références

Brownstein, J. S. et al. Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes Care 33, 526–531 (2010).
pubmed: 20009093
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
pubmed: 24270849 pmcid: 3969265
Kurreeman, F. et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. Am. J. Hum. Genet. 88, 57–69 (2011).
pubmed: 21211616 pmcid: 3014362
Liao, K. P. et al. Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non-rheumatoid arthritis controls. Arthritis Rheumatol. 65, 571–581 (2013).
Canela-Xandri, O. et al. An atlas of genetic associations in UK Biobank. Nat. Genet. 50, 1593–1599 (2018).
pubmed: 30349118 pmcid: 6707814
Gaziano, J. M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
pubmed: 26441289
Banda, J. M. et al. Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network. AMIA Jt. Summit. Transl. Sci. Proc. 2017, (48–57 (2017).
Kho, A. N. et al. Electronic medical records for genetic research: results of the eMERGE consortium. Sci. Transl. Med. 3, 79re71 (2011).
Kirby, J. C. et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inform. Assoc. 23, 1046–1052 (2016).
pubmed: 27026615 pmcid: 5070514
O’Malley, K. J. et al. Measuring diagnoses: ICD code accuracy. Health Serv. Res. 40, 1620–1639 (2005).
pubmed: 16178999 pmcid: 1361216
Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care. Res. 62, 1120–1127 (2010).
Liao, K. P. et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ 350, h1885 (2015).
pubmed: 25911572 pmcid: 4707569
Yu, S. et al. Surrogate-assisted feature extraction for high-throughput phenotyping. J. Am. Med. Inform. Assoc. 24, e143–e149 (2017).
pubmed: 27632993
Yu, S. et al. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. J. Am. Med. Inform. Assoc. 22, 993–1000 (2015).
pubmed: 25929596 pmcid: 4986664
Castro, V. M. et al. Validation of electronic health record phenotyping of bipolar disorder cases and controls. Am. J. Psychiatry 172, 363–372 (2015).
pubmed: 25827034
Murphy, S. N. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17, 124–130 (2010).
pubmed: 20190053 pmcid: 3000779
Son, J. H. et al. Deep phenotyping on electronic health records facilitates genetic diagnosis by clinical exomes. Am. J. Hum. Genet. 103, 58–73 (2018).
pubmed: 29961570 pmcid: 6035281
Rasmussen, L. V. et al. Design patterns for the development of electronic health record-driven phenotype extraction algorithms. J. Biomed. Inform. 51, 280–286 (2014).
pubmed: 24960203
Basile, A. O. et al. Informatics and machine learning to define the phenotype. Expert. Rev. Mol. Diagn. 18, 219–226 (2018).
pubmed: 29431517 pmcid: 6080627
Ananthakrishnan, A. N. et al. Improving case definition of Crohn’s disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach. Inflamm. Bowel. Dis. 19, 1411–1420 (2013).
pubmed: 23567779 pmcid: 3665760
Carroll, R. J. et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. J. Am. Med. Inform. Assoc. 19, e162–e169 (2012).
pubmed: 22374935 pmcid: 3392871
Xia, Z. et al. Modeling disease severity in multiple sclerosis using electronic health records. PLoS One 8, e78927 (2013).
pubmed: 24244385 pmcid: 3823928
Ananthakrishnan, A. N. et al. Association between reduced plasma 25-hydroxy vitamin D and increased risk of cancer in patients with inflammatory bowel diseases. Clin. Gastroenterol. Hepatol. 12, 821–827 (2014).
pubmed: 24161349
Cai, T. et al. The association between arthralgia and vedolizumab using natural language processing. Inflamm. Bowel. Dis. 24, 2242–2246 (2018).
pubmed: 29846617 pmcid: 6140445
Liao, K. P. et al. Association between low density lipoprotein and rheumatoid arthritis genetic factors with low density lipoprotein levels in rheumatoid arthritis and non-rheumatoid arthritis controls. Ann. Rheum. Dis. 73, 1170–1175 (2014).
pubmed: 23716066
Kurreeman, F. A. et al. Use of a multiethnic approach to identify rheumatoid- arthritis-susceptibility loci, 1p36 and 17q12. Am. J. Hum. Genet. 90, 524–532 (2012).
pubmed: 22365150 pmcid: 3309197
Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014).
pubmed: 24390342
Ananthakrishnan, A. N. et al. Common genetic variants influence circulating vitamin D levels in inflammatory bowel diseases. Inflamm. Bowel. Dis. 21, 2507–2514 (2015).
pubmed: 26241000 pmcid: 4615315
Sinnott, J. A. et al. Improving the power of genetic association tests with imperfect phenotype derived from electronic medical records. Hum. Genet. 133, 1369–1382 (2014).
pubmed: 25062868 pmcid: 4185241
Halpern, Y. et al. Electronic medical record phenotyping using the anchor and learn framework. J. Am. Med. Inform. Assoc. 23, 731–740 (2016).
pubmed: 27107443 pmcid: 4926745
Agarwal, V. et al. Learning statistical models of phenotypes using noisy labeled training data. J. Am. Med. Inform. Assoc. 23, 1166–1173 (2016).
pubmed: 27174893 pmcid: 5070523
Yu, S. et al. Enabling phenotypic big data with PheNorm. J. Am. Med. Inform. Assoc. 25, 54–60 (2018).
pubmed: 29126253
Lindberg, D. A. et al. The Unified Medical Language System. Methods Inf. Med. 32, 281–291 (1993).
pubmed: 8412823 pmcid: 6693515
Jupp, S., Burdett, T., Leroy, C. & Parkinson, H. A new ontology lookup service at EMBL-EBI. CEUR Workshop Proc. 1546, 118–119 (2015).
Savova, G. K. et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 17, 507–513 (2010).
pubmed: 20819853 pmcid: 2995668
Goryachev, S. et al. A suite of natural language processing tools developed for the I2B2 project. AMIA Annu. Symp. Proc. 2006, 931 (2006).
pmcid: 1839726
Liu, H. D., Wagholikar, K., Jonnalagadda, S. & Sohn, S. Integrated cTAKES for concept mention detection and normalization. In CEUR Workshop Proceedings, Vol. 1179 (CEUR-WS, 2013).
Aronson, A. R. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc. AMIA Symp. 17-21 (2001).
Yu, S. et al. NILE: fast natural language processing for electronic health records. Preprint at https://arxiv.org/abs/1311.6063 (2013).
Manning, C. et al. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55-60 (Association for Computational Linguistics, 2014).
Chapman, W. W. et al. A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform. 34, 301–310 (2001).
pubmed: 12123149
Castro, V. M. et al. Large-scale identification of patients with cerebral aneurysms using natural language processing. Neurology 88, 164–168 (2017).
pubmed: 27927935 pmcid: 5224711
Castro, V. M. et al. Identification of subjects with polycystic ovary syndrome using electronic health records. Reprod. Biol. Endocrinol. 13, 116 (2015).
pubmed: 26510685 pmcid: 4625743
Jorge, A. et al. Identifying lupus patients in electronic health records: development and validation of machine learning algorithms and application of rule-based algorithms. Semin. Arthritis Rheum. 49, 84–90 (2019).
pubmed: 30665626
Perlis, R. H. et al. Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. Psychol. Med. 42, 41–50 (2012).
pubmed: 21682950
Doss, J., Mo, H., Carroll, R. J., Crofford, L. J. & Denny, J. C. Phenome-wide association study of rheumatoid arthritis subgroups identifies association between seronegative disease and fibromyalgia. Arthritis Rheumatol. 69, 291–300 (2017).
pubmed: 27589350 pmcid: 5274573
Geva, A. et al. A computable phenotype improves cohort ascertainment in a pediatric pulmonary hypertension registry. J. Pediatr. 188, 224–231 (2017).
pubmed: 28625502 pmcid: 5572538

Auteurs

Yichi Zhang (Y)

Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.

Tianrun Cai (T)

Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.

Sheng Yu (S)

Center for Statistical Science, Tsinghua University, Beijing, China.
Department of Industrial Engineering, Tsinghua University, Beijing, China.

Kelly Cho (K)

Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.
Division of Aging, Brigham and Women's Hospital, Boston, MA, USA.

Chuan Hong (C)

Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.

Jiehuan Sun (J)

Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.

Jie Huang (J)

Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.

Yuk-Lam Ho (YL)

Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.

Ashwin N Ananthakrishnan (AN)

Department of Gastroenterology, Massachusetts General Hospital, Boston, MA, USA.

Zongqi Xia (Z)

Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA.

Stanley Y Shaw (SY)

Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA, USA.

Vivian Gainer (V)

Research Information Science and Computing, Partners Healthcare, Boston, MA, USA.

Victor Castro (V)

Research Information Science and Computing, Partners Healthcare, Boston, MA, USA.

Nicholas Link (N)

Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.

Jacqueline Honerlaw (J)

Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.

Sicong Huang (S)

Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.

David Gagnon (D)

Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.
Department of Biostatistics, Boston University, Boston, MA, USA.

Elizabeth W Karlson (EW)

Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.

Robert M Plenge (RM)

Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.

Peter Szolovits (P)

Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA.

Guergana Savova (G)

Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA.

Susanne Churchill (S)

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

Christopher O'Donnell (C)

Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.
Division of Cardiology, VA Boston Healthcare System, Boston, MA, USA.

Shawn N Murphy (SN)

Research Information Science and Computing, Partners Healthcare, Boston, MA, USA.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.

J Michael Gaziano (JM)

Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.
Division of Aging, Brigham and Women's Hospital, Boston, MA, USA.

Isaac Kohane (I)

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

Tianxi Cai (T)

Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

Katherine P Liao (KP)

Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA. kliao@bwh.harvard.edu.
Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA. kliao@bwh.harvard.edu.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. kliao@bwh.harvard.edu.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH