A comprehensive comparison of molecular feature representations for use in predictive modeling.

Cheminformatics Molecular feature representation Predictive modeling Virtual screening

Journal

Computers in biology and medicine
ISSN: 1879-0534
Titre abrégé: Comput Biol Med
Pays: United States
ID NLM: 1250250

Informations de publication

Date de publication:
03 2021
Historique:
received: 18 08 2020
revised: 21 12 2020
accepted: 21 12 2020
pubmed: 12 1 2021
medline: 3 7 2021
entrez: 11 1 2021
Statut: ppublish

Résumé

Machine learning methods are commonly used for predicting molecular properties to accelerate material and drug design. An important part of this process is deciding how to represent the molecules. Typically, machine learning methods expect examples represented by vectors of values, and many methods for calculating molecular feature representations have been proposed. In this paper, we perform a comprehensive comparison of different molecular features, including traditional methods such as fingerprints and molecular descriptors, and recently proposed learnable representations based on neural networks. Feature representations are evaluated on 11 benchmark datasets, used for predicting properties and measures such as mutagenicity, melting points, activity, solubility, and IC50. Our experiments show that several molecular features work similarly well over all benchmark datasets. The ones that stand out most are Spectrophores, which give significantly worse performance than other features on most datasets. Molecular descriptors from the PaDEL library seem very well suited for predicting physical properties of molecules. Despite their simplicity, MACCS fingerprints performed very well overall. The results show that learnable representations achieve competitive performance compared to expert based representations. However, task-specific representations (graph convolutions and Weave methods) rarely offer any benefits, even though they are computationally more demanding. Lastly, combining different molecular feature representations typically does not give a noticeable improvement in performance compared to individual feature representations.

Identifiants

pubmed: 33429140
pii: S0010-4825(20)30528-X
doi: 10.1016/j.compbiomed.2020.104197
pii:
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

104197

Informations de copyright

Copyright © 2020 Elsevier Ltd. All rights reserved.

Auteurs

Tomaž Stepišnik (T)

Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia; Jožef Stefan International Postgraduate School, Ljubljana, Slovenia. Electronic address: tomaz.stepisnik@ijs.si.

Blaž Škrlj (B)

Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia; Jožef Stefan International Postgraduate School, Ljubljana, Slovenia. Electronic address: blaz.skrlj@ijs.si.

Jörg Wicker (J)

The University of Auckland, School of Computer Science, Auckland, New Zealand. Electronic address: j.wicker@auckland.ac.nz.

Dragi Kocev (D)

Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia. Electronic address: dragi.kocev@ijs.si.

Articles similaires

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Understanding the role of machine learning in predicting progression of osteoarthritis.

Simone Castagno, Benjamin Gompels, Estelle Strangmark et al.
1.00
Humans Disease Progression Machine Learning Osteoarthritis
Alzheimer Disease Humans Regression Analysis Quantitative Structure-Activity Relationship Drug Design

Unsupervised learning for real-time and continuous gait phase detection.

Dollaporn Anopas, Yodchanan Wongsawat, Jetsada Arnin
1.00
Humans Gait Neural Networks, Computer Unsupervised Machine Learning Walking

Classifications MeSH