Jurnal INFOTEL
Vol 16 No 4 (2024): November 2024

A Random Oversampling and BERT-based Model Approach for Handling Imbalanced Data in Essay Answer Correction

Sani, Dian Ahkam (Unknown)



Article Info

Publish Date
20 Dec 2024

Abstract

The task of automated essay scoring has long been plagued by the challenge of imbalanced datasets, where the distribution of scores or labels is skewed towards certain categories. This imbalance can lead to poor performance of machine learning models, as they tend to be biased towards the majority class. One potential solution to this problem is the use of oversampling techniques, which aim to balance the dataset by increasing the representation of the minority class. In this paper, we propose a novel approach that combines random oversampling with a BERT-base uncased model for essay answer correction. This research explores various scenario of text pre-processing techniques to optimize model accuracy. Using a dataset of essay answers obtained from eighth-grade middle school students in Indonesian language, our approach demonstrates good performance in terms of precision, recall, F1-score and accuracy compared to traditional methods such as Backpropagation Neural Network, Naïve Bayes and Random Forest Classifier using FastText word embedding with Wikipedia 300 vector size pretrained model. The best performance was obtained using the BERT-base uncased model with 2e-5 learning rate and a simplified pre-processing approach. By retaining punctuation, numbers, and stop words, the model achieved a precision of 0.9463, recall of 0.9377, F1-score of 0.9346, and an accuracy of 94%.

Copyrights © 2024






Journal Info

Abbrev

infotel

Publisher

Subject

Computer Science & IT Electrical & Electronics Engineering

Description

Jurnal INFOTEL is a scientific journal published by Lembaga Penelitian dan Pengabdian Masyarakat (LPPM) of Institut Teknologi Telkom Purwokerto, Indonesia. Jurnal INFOTEL covers the field of informatics, telecommunication, and electronics. First published in 2009 for a printed version and published ...