Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis.

CompLearn Kolmogorov complexity Zgli clustering by compression clustering techniques normalized compression distance

Journal

Sensors (Basel, Switzerland)
ISSN: 1424-8220
Titre abrégé: Sensors (Basel)
Pays: Switzerland
ID NLM: 101204366

Informations de publication

Date de publication:
20 Jan 2023
Historique:
received: 19 11 2022
revised: 13 01 2023
accepted: 17 01 2023
entrez: 11 2 2023
pubmed: 12 2 2023
medline: 15 2 2023
Statut: epublish

Résumé

The normalized compression distance (NCD) is a similarity measure between a pair of finite objects based on compression. Clustering methods usually use distances (e.g., Euclidean distance, Manhattan distance) to measure the similarity between objects. The NCD is yet another distance with particular characteristics that can be used to build the starting distance matrix for methods such as hierarchical clustering or K-medoids. In this work, we propose Zgli, a novel Python module that enables the user to compute the NCD between files inside a given folder. Inspired by the CompLearn Linux command line tool, this module iterates on it by providing new text file compressors, a new compression-by-column option for tabular data, such as CSV files, and an encoder for small files made up of categorical data. Our results demonstrate that compression by column can yield better results than previous methods in the literature when clustering tabular data. Additionally, the categorical encoder shows that it can augment categorical data, allowing the use of the NCD for new data types. One of the advantages is that using this new feature does not require knowledge or context of the data. Furthermore, the fact that the new proposed module is written in Python, one of the most popular programming languages for machine learning, potentiates its use by developers to tackle problems with a new approach based on compression. This pipeline was tested in clinical data and proved a promising computational strategy by providing patient stratification via clusters aiding in precision medicine.

Identifiants

pubmed: 36772258
pii: s23031219
doi: 10.3390/s23031219
pmc: PMC9920187
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : Fundação para a Ciência e Tecnologia
ID : UIDB/00408/2020
Organisme : Fundação para a Ciência e Tecnologia
ID : UIDP/00408/2020
Organisme : Fundação para a Ciência e Tecnologia
ID : UIDB/50008/2020
Organisme : Fundação para a Ciência e Tecnologia
ID : UIDP/50008/2020
Organisme : Fundação para a Ciência e Tecnologia
ID : PREDICT (PTDC/CCI-CIF/29877/2017
Organisme : Fundação para a Ciência e Tecnologia
ID : DSAIPA/DS/0026/2019, PTDC/CCI-BIO/4180/2020, PTDC/CTM-REF/2679/2020

Références

Ann Rheum Dis. 2018 Oct;77(10):1539-1540
pubmed: 29453216
J Rheumatol. 1994 Dec;21(12):2281-5
pubmed: 7699629
Entropy (Basel). 2022 Mar 22;24(4):
pubmed: 35455102
Ann Rheum Dis. 2011 Jan;70(1):47-53
pubmed: 21068095
Bioinformatics. 2001 Feb;17(2):149-54
pubmed: 11238070
Aging Dis. 2021 Oct 1;12(7):1567-1586
pubmed: 34631208
J Bioinform Comput Biol. 2005 Apr;3(2):185-205
pubmed: 15852500
Ann Rheum Dis. 2023 Jan;82(1):19-34
pubmed: 36270658
JMIR Med Inform. 2021 Jul 30;9(7):e26823
pubmed: 34328435
J Biomed Inform. 2022 Oct;134:104172
pubmed: 36055638
J Autoimmun. 2019 Mar;98:24-32
pubmed: 30459097
Acta Reumatol Port. 2011 Jan-Mar;36(1):45-56
pubmed: 21483280
BMC Med Inform Decis Mak. 2019 Dec 30;19(1):289
pubmed: 31888660
Nonlinear Dyn. 2020;101(3):1731-1750
pubmed: 32836811
BMC Bioinformatics. 2008 Nov 27;9:497
pubmed: 19038021

Auteurs

Diogo Azevedo (D)

LASIGE, Departamento de Informática da Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal.

Ana Maria Rodrigues (AM)

EpiDoC Unit, The Chronic Diseases Research Centre, NOVA Medical School, NOVA University of Lisbon, 1169-056 Lisboa, Portugal.
Comprehensive Health Research Center, NOVA Medical School, NOVA University of Lisbon, 1150-082 Lisboa, Portugal.

Helena Canhão (H)

EpiDoC Unit, The Chronic Diseases Research Centre, NOVA Medical School, NOVA University of Lisbon, 1169-056 Lisboa, Portugal.
Comprehensive Health Research Center, NOVA Medical School, NOVA University of Lisbon, 1150-082 Lisboa, Portugal.

Alexandra M Carvalho (AM)

Instituto de Telecomunicações, 1049-001 Lisboa, Portugal.
Department of Electrical and Computer Engineering, Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisboa, Portugal.
Lisbon Unit for Learning and Intelligent Systems, 1049-001 Lisboa, Portugal.

André Souto (A)

LASIGE, Departamento de Informática da Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal.
Instituto de Telecomunicações, 1049-001 Lisboa, Portugal.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH