Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance prediction.


Journal

EBioMedicine
ISSN: 2352-3964
Titre abrégé: EBioMedicine
Pays: Netherlands
ID NLM: 101647039

Informations de publication

Date de publication:
May 2019
Historique:
received: 09 01 2019
revised: 21 02 2019
accepted: 05 04 2019
pubmed: 3 5 2019
medline: 26 11 2019
entrez: 4 5 2019
Statut: ppublish

Résumé

The diagnosis of multidrug resistant and extensively drug resistant tuberculosis is a global health priority. Whole genome sequencing of clinical Mycobacterium tuberculosis isolates promises to circumvent the long wait times and limited scope of conventional phenotypic antimicrobial susceptibility, but gaps remain for predicting phenotype accurately from genotypic data especially for certain drugs. Our primary aim was to perform an exploration of statistical learning algorithms and genetic predictor sets using a rich dataset to build a high performing and fast predicting model to detect anti-tuberculosis drug resistance. We collected targeted or whole genome sequencing and conventional drug resistance phenotyping data from 3601 Mycobacterium tuberculosis strains enriched for resistance to first- and second-line drugs, with 1228 multidrug resistant strains. We investigated the utility of (1) rare variants and variants known to be determinants of resistance for at least one drug and (2) machine and statistical learning architectures in predicting phenotypic drug resistance to 10 anti-tuberculosis drugs. Specifically, we investigated multitask and single task wide and deep neural networks, a multilayer perceptron, regularized logistic regression, and random forest classifiers. The highest performing machine and statistical learning methods included both rare variants and those known to be causal of resistance for at least one drug. Both simpler L2 penalized regression and complex machine learning models had high predictive performance. The average AUCs for our highest performing model was 0.979 for first-line drugs and 0.936 for second-line drugs during repeated cross-validation. On an independent validation set, the highest performing model showed average AUCs, sensitivities, and specificities, respectively, of 0.937, 87.9%, and 92.7% for first-line drugs and 0.891, 82.0% and 90.1% for second-line drugs. Our method outperforms existing approaches based on direct association, with increased sum of sensitivity and specificity of 11.7% on first line drugs and 3.2% on second line drugs. Our method has higher predictive performance compared to previously reported machine learning models during cross-validation, with higher AUCs for 8 of 10 drugs. Statistical models, especially those that are trained using both frequent and less frequent variants, significantly improve the accuracy of resistance prediction and hold promise in bringing sequencing technologies closer to the bedside.

Sections du résumé

BACKGROUND BACKGROUND
The diagnosis of multidrug resistant and extensively drug resistant tuberculosis is a global health priority. Whole genome sequencing of clinical Mycobacterium tuberculosis isolates promises to circumvent the long wait times and limited scope of conventional phenotypic antimicrobial susceptibility, but gaps remain for predicting phenotype accurately from genotypic data especially for certain drugs. Our primary aim was to perform an exploration of statistical learning algorithms and genetic predictor sets using a rich dataset to build a high performing and fast predicting model to detect anti-tuberculosis drug resistance.
METHODS METHODS
We collected targeted or whole genome sequencing and conventional drug resistance phenotyping data from 3601 Mycobacterium tuberculosis strains enriched for resistance to first- and second-line drugs, with 1228 multidrug resistant strains. We investigated the utility of (1) rare variants and variants known to be determinants of resistance for at least one drug and (2) machine and statistical learning architectures in predicting phenotypic drug resistance to 10 anti-tuberculosis drugs. Specifically, we investigated multitask and single task wide and deep neural networks, a multilayer perceptron, regularized logistic regression, and random forest classifiers.
FINDINGS RESULTS
The highest performing machine and statistical learning methods included both rare variants and those known to be causal of resistance for at least one drug. Both simpler L2 penalized regression and complex machine learning models had high predictive performance. The average AUCs for our highest performing model was 0.979 for first-line drugs and 0.936 for second-line drugs during repeated cross-validation. On an independent validation set, the highest performing model showed average AUCs, sensitivities, and specificities, respectively, of 0.937, 87.9%, and 92.7% for first-line drugs and 0.891, 82.0% and 90.1% for second-line drugs. Our method outperforms existing approaches based on direct association, with increased sum of sensitivity and specificity of 11.7% on first line drugs and 3.2% on second line drugs. Our method has higher predictive performance compared to previously reported machine learning models during cross-validation, with higher AUCs for 8 of 10 drugs.
INTERPRETATION CONCLUSIONS
Statistical models, especially those that are trained using both frequent and less frequent variants, significantly improve the accuracy of resistance prediction and hold promise in bringing sequencing technologies closer to the bedside.

Identifiants

pubmed: 31047860
pii: S2352-3964(19)30250-6
doi: 10.1016/j.ebiom.2019.04.016
pmc: PMC6557804
pii:
doi:

Substances chimiques

Antitubercular Agents 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

356-369

Subventions

Organisme : NIEHS NIH HHS
ID : K01 ES026835
Pays : United States

Commentaires et corrections

Type : CommentIn

Informations de copyright

Copyright © 2019 The Authors. Published by Elsevier B.V. All rights reserved.

Références

Eur Respir J. 2008 Nov;32(5):1165-74
pubmed: 18614561
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
Antimicrob Agents Chemother. 2009 Oct;53(10):4138-46
pubmed: 19596878
Int J Tuberc Lung Dis. 2009 Nov;13(11):1320-30
pubmed: 19861002
Genome Res. 2011 Jun;21(6):936-9
pubmed: 20980556
N Engl J Med. 2011 Feb 24;364(8):730-9
pubmed: 21345102
Science. 2011 Sep 16;333(6049):1630-2
pubmed: 21835980
J Clin Microbiol. 2012 Apr;50(4):1233-9
pubmed: 22301024
J Antimicrob Chemother. 2012 Sep;67(9):2107-9
pubmed: 22593564
Nat Rev Genet. 2012 Sep;13(9):601-612
pubmed: 22868263
Bull World Health Organ. 2012 Sep 1;90(9):693-8
pubmed: 22984314
N Engl J Med. 2013 Jul 18;369(3):290-2
pubmed: 23863072
Nat Genet. 2013 Oct;45(10):1190-7
pubmed: 23995136
Nat Genet. 2013 Oct;45(10):1255-60
pubmed: 23995137
J Antimicrob Chemother. 2014 Feb;69(2):331-42
pubmed: 24055765
Genome Biol. 2014 Mar 03;15(3):R46
pubmed: 24580807
Nat Genet. 2014 Aug;46(8):912-918
pubmed: 25017105
Cochrane Database Syst Rev. 2014 Oct 29;(10):CD010705
pubmed: 25353401
Int J Tuberc Lung Dis. 2015 Mar;19(3):339-41
pubmed: 25686144
PLoS One. 2015 Mar 04;10(3):e0118432
pubmed: 25738806
Lancet Infect Dis. 2015 Oct;15(10):1193-1202
pubmed: 26116186
J Clin Microbiol. 2015 Sep;53(9):2961-9
pubmed: 26179309
Clin Infect Dis. 2015 Oct 15;61Suppl 3:S141-6
pubmed: 26409275
Nat Commun. 2015 Dec 21;6:10063
pubmed: 26686880
J Clin Microbiol. 2016 Mar;54(3):727-33
pubmed: 26763957
Proc Natl Acad Sci U S A. 2016 Feb 16;113(7):E839-46
pubmed: 26792518
Am J Respir Crit Care Med. 2016 Sep 1;194(5):621-30
pubmed: 26910495
Eur Respir Rev. 2016 Mar;25(139):29-35
pubmed: 26929418
Nat Med. 2016 Dec;22(12):1470-1474
pubmed: 27798613
Nat Genet. 2017 Mar;49(3):395-402
pubmed: 28092681
J Clin Microbiol. 2017 May;55(5):1285-1298
pubmed: 28275074
J Clin Microbiol. 2017 Jun;55(6):1871-1882
pubmed: 28381603
Chin Med J (Engl). 2017 Jul 5;130(13):1521-1528
pubmed: 28639565
Tuberculosis (Edinb). 2017 Dec;107:63-72
pubmed: 29050774
Bioinformatics. 2018 May 15;34(10):1666-1671
pubmed: 29240876
N Engl J Med. 2018 Oct 11;379(15):1403-1415
pubmed: 30280646
Bioinformatics. 2018 Nov 21;:null
pubmed: 30462147
Nat Commun. 2019 May 13;10(1):2128
pubmed: 31086182

Auteurs

Michael L Chen (ML)

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States of America.

Akshith Doddi (A)

University of Virginia School of Medicine, Charlottesville, VA, United States of America.

Jimmy Royer (J)

Analysis Group Inc., United States of America.

Luca Freschi (L)

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States of America.

Marco Schito (M)

Critical Path Institute, 1730 E River Rd., Tucson, AZ, United States of America.

Matthew Ezewudo (M)

Critical Path Institute, 1730 E River Rd., Tucson, AZ, United States of America.

Isaac S Kohane (IS)

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States of America.

Andrew Beam (A)

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States of America; Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, United States of America.

Maha Farhat (M)

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States of America; Division of Pulmonary & Critical Care, Massachusetts General Hospital, Boston, MA, United States of America. Electronic address: Maha_Farhat@hms.harvard.edu.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH