Data quantity governance for machine learning in materials science.
data governance
data quantity
machine learning
materials science
Journal
National science review
ISSN: 2053-714X
Titre abrégé: Natl Sci Rev
Pays: China
ID NLM: 101633095
Informations de publication
Date de publication:
Jul 2023
Jul 2023
Historique:
received:
27
02
2023
revised:
14
04
2023
accepted:
26
04
2023
medline:
16
6
2023
pubmed:
16
6
2023
entrez:
16
6
2023
Statut:
epublish
Résumé
Data-driven machine learning (ML) is widely employed in the analysis of materials structure-activity relationships, performance optimization and materials design due to its superior ability to reveal latent data patterns and make accurate prediction. However, because of the laborious process of materials data acquisition, ML models encounter the issue of the mismatch between a high dimension of feature space and a small sample size (for traditional ML models) or the mismatch between model parameters and sample size (for deep-learning models), usually resulting in terrible performance. Here, we review the efforts for tackling this issue via feature reduction, sample augmentation and specific ML approaches, and show that the balance between the number of samples and features or model parameters should attract great attention during data quantity governance. Following this, we propose a synergistic data quantity governance flow with the incorporation of materials domain knowledge. After summarizing the approaches to incorporating materials domain knowledge into the process of ML, we provide examples of incorporating domain knowledge into governance schemes to demonstrate the advantages of the approach and applications. The work paves the way for obtaining the required high-quality data to accelerate materials design and discovery based on ML.
Identifiants
pubmed: 37323811
doi: 10.1093/nsr/nwad125
pii: nwad125
pmc: PMC10265966
doi:
Types de publication
Journal Article
Langues
eng
Pagination
nwad125Informations de copyright
© The Author(s) 2023. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd.
Références
Comput Biol Med. 2015 Sep;64:127-37
pubmed: 26164033
Phys Rev Lett. 2003 Sep 26;91(13):135503
pubmed: 14525315
IEEE Trans Neural Netw Learn Syst. 2022 Feb;33(2):494-514
pubmed: 33900922
Nat Commun. 2016 Apr 15;7:11241
pubmed: 27079901
Science. 2006 Jul 28;313(5786):504-7
pubmed: 16873662
Sci Technol Adv Mater. 2019 Oct 2;20(1):1010-1021
pubmed: 31692965
Nat Commun. 2020 Jul 14;11(1):3513
pubmed: 32665539
Sci Adv. 2019 Feb 08;5(2):eaav0693
pubmed: 30783625
J Chem Inf Model. 2019 Sep 23;59(9):3692-3702
pubmed: 31361962
iScience. 2021 Feb 06;24(3):102155
pubmed: 33665573
J Am Chem Soc. 2017 Dec 13;139(49):17870-17881
pubmed: 29129069
ACS Appl Mater Interfaces. 2020 Jul 29;12(30):34041-34048
pubmed: 32613831
Nat Commun. 2021 Nov 15;12(1):6595
pubmed: 34782631
Nat Commun. 2018 Oct 9;9(1):4168
pubmed: 30301890
Sci Total Environ. 2018 May 15;624:661-672
pubmed: 29272835
J Comput Chem. 2018 Feb 5;39(4):191-202
pubmed: 28960343
IEEE Trans Neural Netw Learn Syst. 2021 Nov;32(11):4793-4813
pubmed: 33079674
Nat Commun. 2021 Oct 21;12(1):6136
pubmed: 34675223
ACS Appl Mater Interfaces. 2021 Nov 17;13(45):53303-53313
pubmed: 33985329
Waste Manag. 2021 May 1;126:266-273
pubmed: 33789215
Adv Sci (Weinh). 2019 Sep 02;6(21):1901395
pubmed: 31728287
Anal Chim Acta. 2019 Nov 8;1080:35-42
pubmed: 31409473