A multi-view representation technique based on principal component analysis for enhanced short text clustering.


Journal

PloS one
ISSN: 1932-6203
Titre abrégé: PLoS One
Pays: United States
ID NLM: 101285081

Informations de publication

Date de publication:
2024
Historique:
received: 10 05 2024
accepted: 08 08 2024
medline: 23 8 2024
pubmed: 23 8 2024
entrez: 23 8 2024
Statut: epublish

Résumé

Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.

Identifiants

pubmed: 39178180
doi: 10.1371/journal.pone.0309206
pii: PONE-D-24-18936
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

e0309206

Informations de copyright

Copyright: © 2024 Ahmed et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Déclaration de conflit d'intérêts

The authors have declared that no competing interests exist.

Auteurs

Majid Hameed Ahmed (MH)

CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia.
Ministry of Higher Education and Scientific Research, Baghdad, Iraq.

Sabrina Tiun (S)

CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia.

Nazlia Omar (N)

CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia.

Nor Samsiah Sani (NS)

CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH