JOURNAL OF APPLIED INFORMATICS AND COMPUTING
Vol. 7 No. 2 (2023): December 2023

Clustering Balinese Language Documents using the Balinese Stemmer Method and Mini Batch K-Means with K-Means++

Subali, Made Agus Putra (Unknown)
Sugiartha, I Gusti Rai Agung (Unknown)
Budiarta, Komang (Unknown)
Adnyana, I Made Budi (Unknown)



Article Info

Publish Date
05 Dec 2023

Abstract

Clustering aims to categorize data into n groups, where data within each group exhibits maximum similarity, while the similarity between groups is minimized. Among various clustering methods, k-means is widely employed due to its simplicity and ability to yield optimal clustering results. However, the k-means method is susceptible to slow processing in high-dimensional datasets and the clustering outcomes are sensitive to the initial selection of cluster center values. In addressing these limitations, this study employs the k-means mini-batch method to enhance processing speed for high-dimensional data and utilizes the k-means++ method to optimize the selection of initial cluster center values. The dataset for this research comprises 300 news articles in Balinese sourced from the https://balitv.tv/ website. Prior to the clustering process, a stemming procedure is applied using the Balinese stemmer method to enhance recall. The obtained results reveal that a majority of the 300 data instances exhibit a high degree of similarity, as indicated by the clustering results. If the number of clusters (n) exceeds two, the data fails to be distinctly separated due to the high structural similarity among the data instances. This can be attributed to the relatively small number of words or attributes produced. In future research, feature reduction will be implemented, and a clustering method capable of addressing data overlap will be explored.

Copyrights © 2023






Journal Info

Abbrev

JAIC

Publisher

Subject

Computer Science & IT

Description

Journal of Applied Informatics and Computing (JAIC) Volume 2, Nomor 1, Juli 2018. Berisi tulisan yang diangkat dari hasil penelitian di bidang Teknologi Informatika dan Komputer Terapan dengan e-ISSN: 2548-9828. Terdapat 3 artikel yang telah ditelaah secara substansial oleh tim editorial dan ...