Kamus Besar Bahasa Indonesia (KBBI) is a primary resource for data in research on determining word-meaning similarity in Indonesian. This study investigates the effectiveness of word embedding methods and the term frequency–inverse document frequency (TF-IDF) weighting technique in assessing the semantic similarity of synonym pairs. The objective is to measure the similarity of synonym word pairs listed in KBBI by applying cosine similarity, leveraging TF-IDF weighting, various word embedding models, and latent semantic analysis (LSA). The methodology involved data collection, followed by a text preprocessing stage consisting of case folding, stopword removal, stemming, and tokenization. The processed data were transformed into vector representations using word embedding models, including Word2Vec, fastText, GloVe, and sentence-bidirectional encoder representations from transformers (S-BERT), and TF-IDF. LSA was employed for dimensionality reduction of the vectors before similarity testing using cosine similarity, with final evaluation of the results. The findings revealed that fastText significantly improved the similarity scores between synonym pairs, achieving an average similarity score of 0.901 for 30 synonym pairs. Evaluation results indicated an accuracy of 0.88, a recall of 1.00, a precision of 0.81, and an F1 score of 0.90. These results suggest that fastText is more effective in enhancing the accuracy of synonym meaning similarity measurements. Future research is encouraged to expand the corpus and further explore the use of word embedding for semantic similarity tasks. This study contributes to the natural language processing advancement and provides a potential foundation for more accurate language-based applications that assess word meaning similarity in KBBI.
Copyrights © 2025