DREAMER: a computational framework to evaluate readiness of datasets for machine learning.


Journal

BMC medical informatics and decision making
ISSN: 1472-6947
Titre abrégé: BMC Med Inform Decis Mak
Pays: England
ID NLM: 101088682

Informations de publication

Date de publication:
04 Jun 2024
Historique:
received: 13 08 2023
accepted: 20 05 2024
medline: 4 6 2024
pubmed: 4 6 2024
entrez: 3 6 2024
Statut: epublish

Résumé

Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community.. The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies. Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.

Sections du résumé

BACKGROUND BACKGROUND
Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community..
RESULTS RESULTS
The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies.
CONCLUSION CONCLUSIONS
Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.

Identifiants

pubmed: 38831432
doi: 10.1186/s12911-024-02544-w
pii: 10.1186/s12911-024-02544-w
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

152

Subventions

Organisme : NIH HHS
ID : RF1-AG062109
Pays : United States
Organisme : NIH HHS
ID : U19-AG068753
Pays : United States
Organisme : NIH HHS
ID : RF1-AG062109
Pays : United States
Organisme : NIH HHS
ID : RF1-AG062109
Pays : United States

Informations de copyright

© 2024. The Author(s).

Références

Sarker IH. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci. 2021;2:160.
doi: 10.1007/s42979-021-00592-x pubmed: 33778771 pmcid: 7983091
Lawrence ND. Data readiness levels. arXiv preprint arXiv:170502245. 2017.
Dakka MA, Nguyen TV, Hall JMM, Diakiw SM, VerMilyea M, Linke R, et al. Automated detection of poor-quality data: case studies in healthcare. Sci Rep. 2021;11:18005.
doi: 10.1038/s41598-021-97341-0 pubmed: 34504205 pmcid: 8429593
Austin CC. A path to big data readiness. In: 2018 IEEE International Conference on Big Data (Big Data). IEEE; 2018. pp. 4844–53.
Barham H, Daim T. The use of readiness assessment for big data projects. Sustain Cities Soc. 2020;60:102233.
doi: 10.1016/j.scs.2020.102233
de Hond AAH, Leeuwenberg AM, Hooft L, Kant IMJ, Nijman SWJ, van Os HJA, et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit Med. 2022;5:2.
doi: 10.1038/s41746-021-00549-7 pubmed: 35013569 pmcid: 8748878
Castelijns LA, Maas Y, Vanschoren J. The abc of data: A classifying framework for data readiness. In: Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I. Springer; 2020. pp. 3–16.
Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F. Efficient and Robust Automated Machine Learning. In Advances in neural information processing systems. 2015;28:2962–2970.
Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Iii HD, et al. Datasheets for datasets. Commun ACM. 2021;64:86–92.
doi: 10.1145/3458723
Bender EM, Friedman B. Data statements for natural language processing: toward mitigating system bias and enabling better science. Trans Assoc Comput Linguist. 2018;6:587–604.
doi: 10.1162/tacl_a_00041
Arnold M, Bellamy RKE, Hind M, Houde S, Mehta S, Mojsilović A, et al. FactSheets: increasing trust in AI services through supplier’s declarations of conformity. IBM J Res Dev. 2019;63(4/5):1–6.
doi: 10.1147/JRD.2019.2942288
Holland S, Hosny A, Newman S, Joseph J, Chmielinski K. The dataset nutrition label: A framework to drive higher data quality standards. arXiv preprint arXiv:180503677. Hart Publishing. 2020;12(12):1.
Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, et al. Model cards for model reporting. In: Proceedings of the conference on fairness, accountability, and transparency. 2019. pp. 220–9.
Petersen AH, Ekstrøm CT. dataMaid: your assistant for documenting supervised data quality screening in R. J Stat Softw. 2019;90:1–38.
doi: 10.18637/jss.v090.i06
Arslan RC. How to automatically document data with the codebook package to facilitate data reuse. Adv Methods Pract Psychol Sci. 2019;2:169–87.
doi: 10.1177/2515245919838783
Gupta N, Patel H, Afzal S, Panwar N, Mittal RS, Guttula S, et al. Data Quality Toolkit: automatic assessment of data quality and remediation for machine learning datasets. arXiv Preprint arXiv:210805935. 2021.
Afzal S, Rajmohan C, Kesarwani M, Mehta S, Patel H. Data Readiness Report. In: 2021 IEEE International Conference on Smart Data Services (SMDS). IEEE; 2021. pp. 42–51.
Lavin A, Gilligan-Lee CM, Visnjic A, Ganju S, Newman D, Ganguly S, et al. Technology readiness levels for machine learning systems. Nat Commun. 2022;13:6039.
doi: 10.1038/s41467-022-33128-9 pubmed: 36266298 pmcid: 9585100
Zhang A, Xing L, Zou J, Wu JC. Shifting machine learning for healthcare from development to deployment and from models to data. Nat Biomed Eng. London: Nature Publishing Group; 2022;6(12):1330–45.
ADNI Dataset. http://adni.loni.usc.edu . Accessed 28 May 2024.
FHS Dataset. https://www.framinghamheartstudy.org . Accessed 28 May 2024.
Street WN, Wolberg WH, Mangasarian OL. Nuclear feature extraction for breast tumor diagnosis. Biomedical image processing and biomedical visualization. SPIE; 1993. pp. 861–70.
doi: 10.1117/12.148698
Xu X-Y, Huang X-L, Li Z-M, Gao J, Jiao Z-Q, Wang Y, et al. A scalable photonic computer solving the subset sum problem. Sci Adv. 2020;6:eaay5853.
doi: 10.1126/sciadv.aay5853 pubmed: 32064352 pmcid: 6994215
Oh S. A new dataset evaluation method based on category overlap. Comput Biol Med. 2011;41:115–22.
doi: 10.1016/j.compbiomed.2010.12.006 pubmed: 21216397

Auteurs

Meysam Ahangaran (M)

Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.

Hanzhi Zhu (H)

Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.

Ruihui Li (R)

Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.

Lingkai Yin (L)

Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.

Joseph Jang (J)

Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.

Arnav P Chaudhry (AP)

Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.

Lindsay A Farrer (LA)

Department of Neurology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Department Ophthalmology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA.
Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA.
Boston University Alzheimer's Disease Research Center, Boston, MA, USA.
The Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.

Rhoda Au (R)

Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Department of Neurology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA.
Boston University Alzheimer's Disease Research Center, Boston, MA, USA.
The Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.

Vijaya B Kolachalama (VB)

Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA. vkola@bu.edu.
Boston University Alzheimer's Disease Research Center, Boston, MA, USA. vkola@bu.edu.
Department of Computer Science, Boston University, Boston, MA, USA. vkola@bu.edu.
Faculty of Computing & Data Sciences, Boston University, Boston, MA, 02215, USA. vkola@bu.edu.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH