Claim Missing Document
Check
Articles

Found 1 Documents
Search

The Impact of Text Preprocessing in Sarcasm Detection on Indonesian Social Media Contents Jeremy, Nicholaus Hendrik
Engineering, MAthematics and Computer Science Journal (EMACS) Vol. 7 No. 2 (2025): EMACS
Publisher : Bina Nusantara University

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.21512/emacsjournal.v7i2.13503

Abstract

Sarcasm is a way to convey something but delivered in the opposite way. This behavior is common on social media, where there are plenty of examples. On natural language processing, the task on its own is difficult primarily due to the lack of context. To add another layer of difficulty, communication in social media is done colloquially. One sacrasm benchmark, IdSarcasm, has alleviated one key issue in the development of sarcasm detection. However, there has not been an attempt to further preprocess the input before feeding them into the model. Pre-trained language models always use preprocessed corpus to ensure that the model is built upon quality dataset. Based on the current condition of IdSarcasm, further preprocessing step is necessary to ensure better quality. Specifically, the additional steps needed are handling HTML code, code-mixing, and colloquial writing which consists of shortened form, extended form, spelling variation, and reduplication. Several scenarios are created to observe the effect of additional preprocessing steps. Each additional preprocessing step is also tested to observe the effect of the preprocessing step independently. We prove that preprocessing step is still prevalent for data sourced from social media, and we recommend IndoNLU’s IndoBERT or large multilingual model to be used for sarcasm classification.