Sitopu, Joni Wilson
Faculty of Engineering, Universitas Simalungun, Pematangsiantar

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Evaluating Semantic Geometry of Indonesian News Texts: Agglomerative Clustering Study using IndoBERT Embeddings Sitopu, Joni Wilson
ZERO: Jurnal Sains, Matematika dan Terapan Vol 9, No 3 (2025): Zero: Jurnal Sains Matematika dan Terapan
Publisher : UIN Sumatera Utara

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30829/zero.v9i3.26549

Abstract

This study aims to evaluate the effectiveness of various Agglomerative Clustering configurations in unveiling the Semantic Geometry of a large corpus of Indonesian news texts, represented using IndoBERT Embeddings. The IndoBERT transformer model addresses the limitations of traditional methods (such as TF-IDF) in capturing semantic equivalence despite lexical variations. However, this research finds that the dense (homogeneous) nature of the embeddings necessitates a meticulous clustering methodology. The use of Cosine Similarity resulted in a highly uneven cluster distribution, with one cluster dominating over 99% of the documents, demonstrating a limitation in distinguishing thematic nuances due to the high vector directional similarity. Conversely, the combination of Euclidean Distance with UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction proved optimal. UMAP, as a non-linear technique, successfully decomposed the finer data structure, yielding clusters with the most balanced size (ranging from 4254 to 8204 documents) and being thematically representative. The thematic profiling of the UMAP-Euclidean clusters successfully identified five distinct and granular main themes: Politics, Health & Technology, Macroeconomics & Finance, Economy & Industry, and Education & Social Issues. This research concludes that non-linear dimensionality reduction (UMAP) is a crucial step for clarifying the Semantic Geometry and achieving granular and meaningful clustering on IndoBERT embeddings.