UUM Electronic Theses and Dissertation
UUM ETD | Universiti Utara Malaysian Electronic Theses and Dissertation
FAQs | Feedback | Search Tips | Sitemap

Optimized clustering with modified K-means algorithm

Alibuhtto, Mohamed Cassim (2021) Optimized clustering with modified K-means algorithm. Doctoral thesis, Universiti Utara Malaysia.

[thumbnail of depositpermission-not allow_s902303.pdf] Text
depositpermission-not allow_s902303.pdf
Restricted to Repository staff only

Download (39kB) | Request a copy
[thumbnail of s902303_01.pdf] Text
s902303_01.pdf
Restricted to Repository staff only

Download (3MB) | Request a copy
[thumbnail of s902303_02.pdf] Text
s902303_02.pdf
Restricted to Repository staff only

Download (3MB) | Request a copy

Abstract

huge data is a big challenge. Clustering technique is able to find hidden patterns and to extract useful information from huge data. Among the techniques, the k-means algorithm is the most commonly used technique for determining optimal number of clusters (k). However, the choice of k is a prominent problem in the process of the k-means algorithm. In most cases, for clustering huge data, k is pre-determined by researchers and incorrectly chosen k, could end with wrong interpretation of clusters and increase computational cost. Besides, huge data often face with correlated variables which lead to incorrect clustering process. In order to obtain the optimum number of clusters and at the same time could deal with correlated variables in huge data, modified k-means algorithm was proposed. The proposed algorithm utilised a distance measure to compute the between groups’ separation to accelerate the process of identifying an optimal number of clusters using k-means. Two distance measures were considered namely Euclidean and Manhattan distances. In dealing with correlated variables, PCA was embedded in the proposed algorithm. The developed algorithms were tested on uncorrelated and correlated
simulated data sets, generated under various conditions. Besides, some real data sets were examined to validate the proposed algorithm. Empirical evidences based on simulated data sets indicated that the proposed modified k-means algorithm is able to recognise the optimum number of clusters for uncorrelated data sets. While, the PCA based on modified k-means managed to identify the optimum number of clusters for correlated data sets. Also, the results revealed that the modified k-means algorithm with Euclidean distance yields optimum number of clusters compared to the Manhattan distance. Testing on real data sets showed consistency results as the simulated ones. Generally, the proposed modified k-means algorithm is able to determine the optimum number of clusters for huge data.

Item Type: Thesis (Doctoral)
Supervisor : Mahat, Nor Idayu
Item ID: 9556
Uncontrolled Keywords: Huge data, Distance measure, k-means clustering, Principal component analysis, Validation index.
Subjects: Q Science > QA Mathematics
Divisions: Awang Had Salleh Graduate School of Arts & Sciences
Date Deposited: 27 Jun 2022 07:05
Last Modified: 27 Jun 2022 07:05
Department: Awang Had Salleh Graduate School of Arts & Sciences
Name: Mahat, Nor Idayu
URI: https://etd.uum.edu.my/id/eprint/9556

Actions (login required)

View Item
View Item