Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.

Bootstrap aggregation (bagging) Chemical classification Class distribution imbalance Edited nearest neighbor (ENN) Ensemble learning Molecular fingerprints Random forest (RF) Random undersampling (RUS) Resampling Structure–activity relationship (SAR) Synthetic minority over-sampling technique (SMOTE)

Journal

Journal of cheminformatics
ISSN: 1758-2946
Titre abrégé: J Cheminform
Pays: England
ID NLM: 101516718

Informations de publication

Date de publication:
27 Oct 2020
Historique:
received: 13 12 2019
accepted: 13 10 2020
entrez: 29 12 2020
pubmed: 30 12 2020
medline: 30 12 2020
Statut: epublish

Résumé

The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure-Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F

Identifiants

pubmed: 33372637
doi: 10.1186/s13321-020-00468-x
pii: 10.1186/s13321-020-00468-x
pmc: PMC7592558
doi:

Types de publication

Journal Article

Langues

eng

Pagination

66

Références

ACS Omega. 2019 Aug 26;4(11):14360-14368
pubmed: 31528788
J Chem Inf Model. 2012 Jul 23;52(7):1757-68
pubmed: 22587354
J Chem Inf Model. 2009 Nov;49(11):2481-8
pubmed: 19860412
J Mol Graph Model. 2012 May;35:21-7
pubmed: 22481075
Front Chem. 2018 Aug 28;6:362
pubmed: 30271769
J Chem Inf Comput Sci. 2003 Mar-Apr;43(2):525-31
pubmed: 12653517
J Cheminform. 2019 Jan 10;11(1):4
pubmed: 30631996
Mol Divers. 2016 Feb;20(1):93-109
pubmed: 26643659
J Chem Inf Model. 2010 May 24;50(5):742-54
pubmed: 20426451
PLoS One. 2015 Mar 04;10(3):e0118432
pubmed: 25738806
J Chem Inf Model. 2017 Jul 24;57(7):1591-1598
pubmed: 28628322
Oncotarget. 2017 Oct 10;8(54):92989-93000
pubmed: 29190972
J Chem Inf Model. 2013 Dec 23;53(12):3244-61
pubmed: 24279462
Front Physiol. 2019 Aug 13;10:1044
pubmed: 31456700
Chem Res Toxicol. 2016 Jun 20;29(6):1003-10
pubmed: 27152554
Front Environ Sci. 2016 Mar;4:
pubmed: 27642585
Drug Discov Today. 2014 Aug;19(8):1069-80
pubmed: 24560935
J Mol Graph Model. 2017 Mar;72:256-265
pubmed: 28135672
Int Conf Affect Comput Intell Interact Workshops. 2013;2013:245-251
pubmed: 25574450
Mol Pharm. 2017 Nov 6;14(11):3935-3953
pubmed: 29037046
PLoS One. 2017 Jun 2;12(6):e0177678
pubmed: 28574989
Eur J Med Chem. 2010 Apr;45(4):1590-7
pubmed: 20110136

Auteurs

Gabriel Idakwo (G)

School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA.

Sundar Thangapandian (S)

Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.

Joseph Luttrell (J)

School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA.

Yan Li (Y)

Bennett Aerospace Inc, Cary, NC, 27518, USA.

Nan Wang (N)

Department of Computer Science, New Jersey City University, Jersey City, NJ, 07305, USA.

Zhaoxian Zhou (Z)

School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA.

Huixiao Hong (H)

Division of Bioinformatics and Biostatistics, National Centre for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA.

Bei Yang (B)

School of Information & Engineering, Zhengzhou University, Zhengzhou, 450000, China.

Chaoyang Zhang (C)

School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA. chaoyang.zhang@usm.edu.

Ping Gong (P)

Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA. Ping.Gong@usace.army.mil.

Classifications MeSH