Boolean logic algebra driven similarity measure for text based applications.

Empirical study Information retrieval Similarity measure Text classification Text clustering

Journal

PeerJ. Computer science
ISSN: 2376-5992
Titre abrégé: PeerJ Comput Sci
Pays: United States
ID NLM: 101660598

Informations de publication

Date de publication:
2021
Historique:
received: 18 12 2020
accepted: 22 06 2021
entrez: 17 8 2021
pubmed: 18 8 2021
medline: 18 8 2021
Statut: epublish

Résumé

In Information Retrieval (IR), Data Mining (DM), and Machine Learning (ML), similarity measures have been widely used for text clustering and classification. The similarity measure is the cornerstone upon which the performance of most DM and ML algorithms is completely dependent. Thus, till now, the endeavor in literature for an effective and efficient similarity measure is still immature. Some recently-proposed similarity measures were effective, but have a complex design and suffer from inefficiencies. This work, therefore, develops an effective and efficient similarity measure of a simplistic design for text-based applications. The measure developed in this work is driven by Boolean logic algebra basics (BLAB-SM), which aims at effectively reaching the desired accuracy at the fastest run time as compared to the recently developed state-of-the-art measures. Using the term frequency-inverse document frequency (TF-IDF) schema, the K-nearest neighbor (KNN), and the K-means clustering algorithm, a comprehensive evaluation is presented. The evaluation has been experimentally performed for BLAB-SM against seven similarity measures on two most-popular datasets, Reuters-21 and Web-KB. The experimental results illustrate that BLAB-SM is not only more efficient but also significantly more effective than state-of-the-art similarity measures on both classification and clustering tasks.

Identifiants

pubmed: 34401474
doi: 10.7717/peerj-cs.641
pii: cs-641
pmc: PMC8330432
doi:

Types de publication

Journal Article

Langues

eng

Pagination

e641

Informations de copyright

© 2021 Abdalla and Amer.

Déclaration de conflit d'intérêts

The authors declare that they have no competing interests.

Références

PeerJ Comput Sci. 2019 May 13;5:e194
pubmed: 33816847
PeerJ Comput Sci. 2019 Dec 9;5:e242
pubmed: 33816895

Auteurs

Hassan I Abdalla (HI)

College of Technological Innovation, Zayed University, Abu Dhabi, Abu Dhabi, United Arab Emirates.

Ali A Amer (AA)

Computer Science Department, Taiz University, Taiz, Yemen.

Classifications MeSH