DREAMER: a computational framework to evaluate readiness of datasets for machine learning.

Humans Machine Learning Datasets as Topic Unsupervised Machine Learning Algorithms Supervised Machine Learning Software

Data quality measure Data readiness Feature engineering Machine learning

Journal

BMC medical informatics and decision making

ISSN: 1472-6947

Titre abrégé: BMC Med Inform Decis Mak

Pays: England

ID NLM: 101088682

Informations de publication

Date de publication:
04 Jun 2024

Historique:

received: 13 08 2023

accepted: 20 05 2024

medline: 4 6 2024

pubmed: 4 6 2024

entrez: 3 6 2024

Statut: epublish

Résumé

Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community.. The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies. Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.

Sections du résumé

BACKGROUND BACKGROUND

RESULTS RESULTS

The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies.

CONCLUSION CONCLUSIONS

Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.

Identifiants

DOI: 10.1186/s12911-024-02544-w PMID: 38831432

pubmed: 38831432

doi: 10.1186/s12911-024-02544-w

pii: 10.1186/s12911-024-02544-w

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

152

Subventions

Organisme : NIH HHS

ID : RF1-AG062109

Pays : United States

Organisme : NIH HHS

ID : U19-AG068753

Pays : United States

Organisme : NIH HHS

ID : RF1-AG062109

Pays : United States

Organisme : NIH HHS

ID : RF1-AG062109

Pays : United States

Informations de copyright

Références

Sarker IH. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci. 2021;2:160.

doi: 10.1007/s42979-021-00592-x pubmed: 33778771 pmcid: 7983091

Lawrence ND. Data readiness levels. arXiv preprint arXiv:170502245. 2017.

Dakka MA, Nguyen TV, Hall JMM, Diakiw SM, VerMilyea M, Linke R, et al. Automated detection of poor-quality data: case studies in healthcare. Sci Rep. 2021;11:18005.

doi: 10.1038/s41598-021-97341-0 pubmed: 34504205 pmcid: 8429593

Austin CC. A path to big data readiness. In: 2018 IEEE International Conference on Big Data (Big Data). IEEE; 2018. pp. 4844–53.

Barham H, Daim T. The use of readiness assessment for big data projects. Sustain Cities Soc. 2020;60:102233.

doi: 10.1016/j.scs.2020.102233

de Hond AAH, Leeuwenberg AM, Hooft L, Kant IMJ, Nijman SWJ, van Os HJA, et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit Med. 2022;5:2.

doi: 10.1038/s41746-021-00549-7 pubmed: 35013569 pmcid: 8748878

Castelijns LA, Maas Y, Vanschoren J. The abc of data: A classifying framework for data readiness. In: Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I. Springer; 2020. pp. 3–16.

Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F. Efficient and Robust Automated Machine Learning. In Advances in neural information processing systems. 2015;28:2962–2970.

Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Iii HD, et al. Datasheets for datasets. Commun ACM. 2021;64:86–92.

doi: 10.1145/3458723

Bender EM, Friedman B. Data statements for natural language processing: toward mitigating system bias and enabling better science. Trans Assoc Comput Linguist. 2018;6:587–604.

doi: 10.1162/tacl_a_00041

Arnold M, Bellamy RKE, Hind M, Houde S, Mehta S, Mojsilović A, et al. FactSheets: increasing trust in AI services through supplier’s declarations of conformity. IBM J Res Dev. 2019;63(4/5):1–6.

doi: 10.1147/JRD.2019.2942288

Holland S, Hosny A, Newman S, Joseph J, Chmielinski K. The dataset nutrition label: A framework to drive higher data quality standards. arXiv preprint arXiv:180503677. Hart Publishing. 2020;12(12):1.

Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, et al. Model cards for model reporting. In: Proceedings of the conference on fairness, accountability, and transparency. 2019. pp. 220–9.

Petersen AH, Ekstrøm CT. dataMaid: your assistant for documenting supervised data quality screening in R. J Stat Softw. 2019;90:1–38.

doi: 10.18637/jss.v090.i06

Arslan RC. How to automatically document data with the codebook package to facilitate data reuse. Adv Methods Pract Psychol Sci. 2019;2:169–87.

doi: 10.1177/2515245919838783

Gupta N, Patel H, Afzal S, Panwar N, Mittal RS, Guttula S, et al. Data Quality Toolkit: automatic assessment of data quality and remediation for machine learning datasets. arXiv Preprint arXiv:210805935. 2021.

Afzal S, Rajmohan C, Kesarwani M, Mehta S, Patel H. Data Readiness Report. In: 2021 IEEE International Conference on Smart Data Services (SMDS). IEEE; 2021. pp. 42–51.

Lavin A, Gilligan-Lee CM, Visnjic A, Ganju S, Newman D, Ganguly S, et al. Technology readiness levels for machine learning systems. Nat Commun. 2022;13:6039.

doi: 10.1038/s41467-022-33128-9 pubmed: 36266298 pmcid: 9585100

Zhang A, Xing L, Zou J, Wu JC. Shifting machine learning for healthcare from development to deployment and from models to data. Nat Biomed Eng. London: Nature Publishing Group; 2022;6(12):1330–45.

ADNI Dataset. http://adni.loni.usc.edu . Accessed 28 May 2024.

FHS Dataset. https://www.framinghamheartstudy.org . Accessed 28 May 2024.

Street WN, Wolberg WH, Mangasarian OL. Nuclear feature extraction for breast tumor diagnosis. Biomedical image processing and biomedical visualization. SPIE; 1993. pp. 861–70.

doi: 10.1117/12.148698

Xu X-Y, Huang X-L, Li Z-M, Gao J, Jiao Z-Q, Wang Y, et al. A scalable photonic computer solving the subset sum problem. Sci Adv. 2020;6:eaay5853.

doi: 10.1126/sciadv.aay5853 pubmed: 32064352 pmcid: 6994215

Oh S. A new dataset evaluation method based on category overlap. Comput Biol Med. 2011;41:115–22.

doi: 10.1016/j.compbiomed.2010.12.006 pubmed: 21216397

DREAMER: a computational framework to evaluate readiness of datasets for machine learning.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Informations de copyright

Références

Auteurs

Meysam Ahangaran (M)

Hanzhi Zhu (H)

Ruihui Li (R)

Lingkai Yin (L)

Joseph Jang (J)

Arnav P Chaudhry (AP)

Lindsay A Farrer (LA)

Rhoda Au (R)

Vijaya B Kolachalama (VB)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH