A Novel Model on Reinforce K-Means Using Location Division Model and Outlier of Initial Value for Lowering Data Cost.

K-means data driven data science density data hybrid initial seed location division outliers python data analysis

Journal

Entropy (Basel, Switzerland)

ISSN: 1099-4300

Titre abrégé: Entropy (Basel)

Pays: Switzerland

ID NLM: 101243874

Informations de publication

Date de publication:
17 Aug 2020

Historique:

received: 18 06 2020

revised: 27 07 2020

accepted: 11 08 2020

entrez: 8 12 2020

pubmed: 9 12 2020

medline: 9 12 2020

Statut: epublish

Résumé

Today, semi-structured and unstructured data are mainly collected and analyzed for data analysis applicable to various systems. Such data have a dense distribution of space and usually contain outliers and noise data. There have been ongoing research studies on clustering algorithms to classify such data (outliers and noise data). The K-means algorithm is one of the most investigated clustering algorithms. Researchers have pointed out a couple of problems such as processing clustering for the number of clusters, K, by an analyst through his or her random choices, producing biased results in data classification through the connection of nodes in dense data, and higher implementation costs and lower accuracy according to the selection models of the initial centroids. Most K-means researchers have pointed out the disadvantage of outliers belonging to external or other clusters instead of the concerned ones when K is big or small. Thus, the present study analyzed problems with the selection of initial centroids in the existing K-means algorithm and investigated a new K-means algorithm of selecting initial centroids. The present study proposed a method of cutting down clustering calculation costs by applying an initial center point approach based on space division and outliers so that no objects would be subordinate to the initial cluster center for dependence lower from the initial cluster center. Since data containing outliers could lead to inappropriate results when they are reflected in the choice of a center point of a cluster, the study proposed an algorithm to minimize the error rates of outliers based on an improved algorithm for space division and distance measurement. The performance experiment results of the proposed algorithm show that it lowered the execution costs by about 13-14% compared with those of previous studies when there was an increase in the volume of clustering data or the number of clusters. It also recorded a lower frequency of outliers, a lower effectiveness index, which assesses performance deterioration with outliers, and a reduction of outliers by about 60%.

Identifiants

DOI: 10.3390/e22080902 PMID: 33286671 PMC: PMC7517527

pubmed: 33286671

pii: e22080902

doi: 10.3390/e22080902

pmc: PMC7517527

pii:

doi:

Types de publication

Journal Article

Langues

eng

Subventions

Organisme : National Research Foundation of Korea (NRF)

ID : NRF-2019R1G1A1002205

Références

Neural Comput. 2017 Nov;29(11):3094-3117

pubmed: 28957026

Psychon Bull Rev. 2002 Dec;9(4):829-35

pubmed: 12613690

IEEE Trans Pattern Anal Mach Intell. 2007 Mar;29(3):394-410

pubmed: 17224611

IEEE Trans Pattern Anal Mach Intell. 2005 May;27(5):657-68

pubmed: 15875789

IEEE Trans Cybern. 2014 Dec;44(12):2405-17

pubmed: 25415946

IEEE Trans Pattern Anal Mach Intell. 2020 May;42(5):1191-1204

pubmed: 30640600

A Novel Model on Reinforce K-Means Using Location Division Model and Outlier of Initial Value for Lowering Data Cost.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Subventions

Références

Auteurs

Se-Hoon Jung (SH)

Hansung Lee (H)

Jun-Ho Huh (JH)

Classifications MeSH