Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
Vol 9 No 5 (2025): October 2025

Comparing Data Preprocessing Strategy on T5 Architecture to Classify ICD-10 Diagnosis

Lanang Wijayakusuma, I Gusti Ngurah (Unknown)
Sudarma, Made (Unknown)
Darma Putra, I Ketut Gede (Unknown)
Sudana, Oka (Unknown)
Astutik, Dian (Unknown)



Article Info

Publish Date
24 Oct 2025

Abstract

Manual ICD-10 coding in healthcare systems remains time-consuming, error-prone, and inefficient, particularly in resource-constrained settings. This study investigates the effect of various preprocessing strategies on the performance of the Text-to-Text Transfer Transformer (T5) model for primary diagnosis classification using structured clinical data. A total of 7,263 clinical records were collected from two high-density regions in Bali (Badung and Gianyar) between January 2023 and March 2024, then converted into descriptive text prompts for model training. Four experimental scenarios combined variations of input features and label configurations, comparing T5 with Oversampling against T5 with Easy Data Augmentation (EDA) plus Oversampling. Results showed that T5 with Random Oversampling consistently outperformed the EDA-based configuration across all scenarios, with performance gaps ranging from 8% to 19%. Scenario 4, which excluded body system features and the semantically overlapping E860 label, achieved the highest balance, reaching 84.7% accuracy, 85.1% precision, 84.7% recall, and 84.3% F1-score. Conversely, the EDA-based approach reduced training time by up to 72%, indicating a clear trade-off between performance and efficiency. Both configurations frequently misclassified semantically similar codes within the same ICD-10 categories, underscoring the difficulty of distinguishing clinically related diagnoses. Overall, the results suggest that careful selection of preprocessing strategies can enhance transformer-based medical text classification, while striking a balance between model performance and training efficiency. This work may serve as an initial reference for developing more efficient semi-automated medical coding systems in the Indonesian regional healthcare context.

Copyrights © 2025






Journal Info

Abbrev

RESTI

Publisher

Subject

Computer Science & IT Engineering

Description

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) dimaksudkan sebagai media kajian ilmiah hasil penelitian, pemikiran dan kajian analisis-kritis mengenai penelitian Rekayasa Sistem, Teknik Informatika/Teknologi Informasi, Manajemen Informatika dan Sistem Informasi. Sebagai bagian dari semangat ...