JOIN (Jurnal Online Informatika)
Vol 10 No 1 (2025)

Study of the Application of Text Augmentation with Paraphrasing to Overcome Imbalanced Data in Indonesian Text Classification

Sari, Mutiara Indryan (Unknown)
Suadaa, Lya Hulliyyatus (Unknown)



Article Info

Publish Date
01 Apr 2025

Abstract

Data imbalance in text classification often leads to poor recognition of minority classes, as classifiers tend to favor majority categories. This study addresses the data imbalance issue in Indonesian text classification by proposing a novel text augmentation approach using fine-tuned pre-trained models: IndoGPT2, IndoBART-v2, and mBART50. Unlike back-translation, which struggles with informal text, text augmentation using pre-trained models significantly improves the F1 score of minority labels, with fine-tuned mBART50 outperforming back translation and other models by balancing semantic preservation and lexical diversity. However, the approach faces limitations, including the risk of overfitting due to synthetic text's lack of natural variations, restricted generalizability from reliance on datasets such as ParaCotta, and the high computational costs associated with fine-tuning large models like mBART50. Future research should explore hybrid methods that integrate synthetic and real-world data to enhance text quality and diversity, as well as develop smaller, more efficient models to reduce computational demands. The findings underscore the potential of pre-trained models for text augmentation while emphasizing the importance of considering dataset characteristics, language style, and augmentation volume to achieve optimal results.

Copyrights © 2025






Journal Info

Abbrev

join

Publisher

Subject

Computer Science & IT

Description

JOIN (Jurnal Online Informatika) is a scientific journal published by the Department of Informatics UIN Sunan Gunung Djati Bandung. This journal contains scientific papers from Academics, Researchers, and Practitioners about research on informatics. JOIN (Jurnal Online Informatika) is published ...