DREAMER: a computational framework to evaluate readiness of datasets for machine learning.
Data quality measure
Data readiness
Feature engineering
Machine learning
Journal
BMC medical informatics and decision making
ISSN: 1472-6947
Titre abrégé: BMC Med Inform Decis Mak
Pays: England
ID NLM: 101088682
Informations de publication
Date de publication:
04 Jun 2024
04 Jun 2024
Historique:
received:
13
08
2023
accepted:
20
05
2024
medline:
4
6
2024
pubmed:
4
6
2024
entrez:
3
6
2024
Statut:
epublish
Résumé
Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community.. The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies. Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.
Sections du résumé
BACKGROUND
BACKGROUND
Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community..
RESULTS
RESULTS
The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies.
CONCLUSION
CONCLUSIONS
Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.
Identifiants
pubmed: 38831432
doi: 10.1186/s12911-024-02544-w
pii: 10.1186/s12911-024-02544-w
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
152Subventions
Organisme : NIH HHS
ID : RF1-AG062109
Pays : United States
Organisme : NIH HHS
ID : U19-AG068753
Pays : United States
Organisme : NIH HHS
ID : RF1-AG062109
Pays : United States
Organisme : NIH HHS
ID : RF1-AG062109
Pays : United States
Informations de copyright
© 2024. The Author(s).
Références
Sarker IH. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci. 2021;2:160.
doi: 10.1007/s42979-021-00592-x
pubmed: 33778771
pmcid: 7983091
Lawrence ND. Data readiness levels. arXiv preprint arXiv:170502245. 2017.
Dakka MA, Nguyen TV, Hall JMM, Diakiw SM, VerMilyea M, Linke R, et al. Automated detection of poor-quality data: case studies in healthcare. Sci Rep. 2021;11:18005.
doi: 10.1038/s41598-021-97341-0
pubmed: 34504205
pmcid: 8429593
Austin CC. A path to big data readiness. In: 2018 IEEE International Conference on Big Data (Big Data). IEEE; 2018. pp. 4844–53.
Barham H, Daim T. The use of readiness assessment for big data projects. Sustain Cities Soc. 2020;60:102233.
doi: 10.1016/j.scs.2020.102233
de Hond AAH, Leeuwenberg AM, Hooft L, Kant IMJ, Nijman SWJ, van Os HJA, et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit Med. 2022;5:2.
doi: 10.1038/s41746-021-00549-7
pubmed: 35013569
pmcid: 8748878
Castelijns LA, Maas Y, Vanschoren J. The abc of data: A classifying framework for data readiness. In: Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I. Springer; 2020. pp. 3–16.
Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F. Efficient and Robust Automated Machine Learning. In Advances in neural information processing systems. 2015;28:2962–2970.
Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Iii HD, et al. Datasheets for datasets. Commun ACM. 2021;64:86–92.
doi: 10.1145/3458723
Bender EM, Friedman B. Data statements for natural language processing: toward mitigating system bias and enabling better science. Trans Assoc Comput Linguist. 2018;6:587–604.
doi: 10.1162/tacl_a_00041
Arnold M, Bellamy RKE, Hind M, Houde S, Mehta S, Mojsilović A, et al. FactSheets: increasing trust in AI services through supplier’s declarations of conformity. IBM J Res Dev. 2019;63(4/5):1–6.
doi: 10.1147/JRD.2019.2942288
Holland S, Hosny A, Newman S, Joseph J, Chmielinski K. The dataset nutrition label: A framework to drive higher data quality standards. arXiv preprint arXiv:180503677. Hart Publishing. 2020;12(12):1.
Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, et al. Model cards for model reporting. In: Proceedings of the conference on fairness, accountability, and transparency. 2019. pp. 220–9.
Petersen AH, Ekstrøm CT. dataMaid: your assistant for documenting supervised data quality screening in R. J Stat Softw. 2019;90:1–38.
doi: 10.18637/jss.v090.i06
Arslan RC. How to automatically document data with the codebook package to facilitate data reuse. Adv Methods Pract Psychol Sci. 2019;2:169–87.
doi: 10.1177/2515245919838783
Gupta N, Patel H, Afzal S, Panwar N, Mittal RS, Guttula S, et al. Data Quality Toolkit: automatic assessment of data quality and remediation for machine learning datasets. arXiv Preprint arXiv:210805935. 2021.
Afzal S, Rajmohan C, Kesarwani M, Mehta S, Patel H. Data Readiness Report. In: 2021 IEEE International Conference on Smart Data Services (SMDS). IEEE; 2021. pp. 42–51.
Lavin A, Gilligan-Lee CM, Visnjic A, Ganju S, Newman D, Ganguly S, et al. Technology readiness levels for machine learning systems. Nat Commun. 2022;13:6039.
doi: 10.1038/s41467-022-33128-9
pubmed: 36266298
pmcid: 9585100
Zhang A, Xing L, Zou J, Wu JC. Shifting machine learning for healthcare from development to deployment and from models to data. Nat Biomed Eng. London: Nature Publishing Group; 2022;6(12):1330–45.
ADNI Dataset. http://adni.loni.usc.edu . Accessed 28 May 2024.
FHS Dataset. https://www.framinghamheartstudy.org . Accessed 28 May 2024.
Street WN, Wolberg WH, Mangasarian OL. Nuclear feature extraction for breast tumor diagnosis. Biomedical image processing and biomedical visualization. SPIE; 1993. pp. 861–70.
doi: 10.1117/12.148698
Xu X-Y, Huang X-L, Li Z-M, Gao J, Jiao Z-Q, Wang Y, et al. A scalable photonic computer solving the subset sum problem. Sci Adv. 2020;6:eaay5853.
doi: 10.1126/sciadv.aay5853
pubmed: 32064352
pmcid: 6994215
Oh S. A new dataset evaluation method based on category overlap. Comput Biol Med. 2011;41:115–22.
doi: 10.1016/j.compbiomed.2010.12.006
pubmed: 21216397