Mathelinea, Devy
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Comparative Evaluation of Data Clustering Accuracy through Integration of Dimensionality Reduction and Distance Metric Hasugian, Paska Marto; Mathelinea, Devy; Simamora, Siska; Simangunsong, Pandi Barita Nauli
MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer Vol. 24 No. 3 (2025)
Publisher : Universitas Bumigora

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30812/matrik.v24i3.5057

Abstract

The primary issue in clustering analysis of multivariate data is the low accuracy resulting from a mismatch between the Distance Metric used and the characteristics of the data. This study aims to comprehensively evaluate the effect of eight Distance Metric in the KMeans algorithm integrated with the Principal Component Analysis (PCA)dimension reduction technique. The analysis process was conducted by transforming the data into two principal components using PCA, then applying K-Means to each Distance Metric. Performance evaluation was conducted based on five internal metrics: Silhouette Score, Davies-Bouldin Index, Sum of Squared Errors, Calinski-Harabasz Index, and Dunn Index. The results show that the Bray-Curtis formula provides the best performance, with a Silhouette Score of 0.4291 and SSE of 30.3673. This is followed by Euclidean and Minkowski, which yield the highest Calinski-Harabasz Index value of 2239.85 and Dunn Index of 0.0108, respectively. In contrast, Hamming’s formula yielded the lowest performance across all metrics, with a Silhouette Score of 0.0000 and an SSE of 1996.00. The ANOVA test revealed significant differences between the Distance Metric, with a p-value of ¡0.000 for all metrics, which was further supported by the Tukey HSD follow-up test results. The implications of these findings confirm the importance of selecting an appropriate Distance Metric in the clustering process to ensure the validity, efficiency, and interpretability of multivariate data analysis results.