Kamila, Sabrina Adnin
Unknown Affiliation

Published : 2 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 2 Documents
Search

Exploring a Large Language Model on the ChatGPT Platform for Indonesian Text Preprocessing Tasks Suhaeni, Cici; Kamila, Sabrina Adnin; Fahira, Fani; Yusran, Muhammad; Alfa Dito, Gerry
Indonesian Journal of Statistics and Applications Vol 9 No 1 (2025)
Publisher : Statistics and Data Science Program Study, IPB University, IPB University, in collaboration with the Forum Pendidikan Tinggi Statistika Indonesia (FORSTAT) and the Ikatan Statistisi Indonesia (ISI)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29244/ijsa.v9i1p100-116

Abstract

Preprocessing is a crucial step in Natural Language Processing, especially for informal languages like Indonesian, which contain complex morphology, slang, abbreviations, and non-standard expressions. Traditional rule-based tools such as regex, IndoNLP, and Sastrawi are commonly used but often fall short in handling noisy, user-generated text. This study explores the capability of Large Language Model, particularly ChatGPT-o3, in performing Indonesian text preprocessing tasks, namely text cleaning, normalization, stopword removal, and stemming/lemmatization, and compares it to conventional rule-based approaches. Using two types of datasets, consisting of a small example dataset of five manually constructed sentences and a real-world dataset of 100 tweets about the Indonesian “Makan Bergizi Gratis” program, both preprocessing methods were applied and evaluated. Results show that ChatGPT-o3 performs equally well in text cleaning and significantly better in normalization. However, rule-based methods like IndoNLP and Sastrawi still outperform ChatGPT-o3 in stopword removal and stemming. These findings indicate that while ChatGPT-o3 demonstrates strong contextual understanding and linguistic flexibility, they may underperform in rigid, token-based operations without fine-tuning. This study provides initial insights into using Large Language Models as an alternative preprocessing engine for Indonesian text and highlights the need for hybrid approaches or improved prompt design in future applications.
Pemodelan Topik pada Komentar YouTube Arra: Komparasi LDA dan K-Means Menggunakan Fitur Leksikal dan Semantik Nuradilla, Siti; Kamila, Sabrina Adnin; Zahra, Latifah; Suhaeni, Cici; Sartono, Bagus
Jurnal Informatika: Jurnal Pengembangan IT Vol 10, No 3 (2025)
Publisher : Politeknik Harapan Bersama

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30591/jpit.v10i3.8763

Abstract

YouTube has become a platform for sharing content, including positive material and stereotypes that often trigger debates. One noteworthy phenomenon is the video of Arra, a toddler known for her remarkable communication skills. This uniqueness has drawn significant attention and sparked debates about the mismatch between her age and cognitive development. The diverse comments on Arra’s videos reflect sharply differing perspectives among netizens, making manual analysis highly challenging. Therefore, it is important to examine the topics discussed by netizens to understand the dominant issues emerging in these discussions. Through this approach, the public can gain insights, and parents may receive valuable input regarding child-rearing practices. The main objective of this study is to explore the effectiveness of the two methods and their combinations of text representations in identifying key topics within comments by comparing the coherence performance of the models. This research applies topic modeling to analyze comments using two primary approaches: Latent Dirichlet Allocation (LDA) and K-Means clustering. The study involves data collection through comment crawling, followed by text preprocessing and text representation using TF-IDF and GloVe embeddings. LDA and K-Means are then used to identify dominant topics appearing in the comments. The results show that LDA with TF-IDF achieved the highest coherence score of 0.662, although the resulting topics were still difficult to interpret due to overlap. Meanwhile, K-Means with GloVe 100D yielded a slightly lower coherence score of 0.6538 but outperformed in terms of interpretability. Therefore, K-Means with GloVe 100D is considered a more balanced approach in terms of both coherence and topic readability.