Astutik, Dian
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Comparing Data Preprocessing Strategy on T5 Architecture to Classify ICD-10 Diagnosis Lanang Wijayakusuma, I Gusti Ngurah; Sudarma, Made; Darma Putra, I Ketut Gede; Sudana, Oka; Astutik, Dian
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol 9 No 5 (2025): October 2025
Publisher : Ikatan Ahli Informatika Indonesia (IAII)

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.29207/resti.v9i5.6919

Abstract

Manual ICD-10 coding in healthcare systems remains time-consuming, error-prone, and inefficient, particularly in resource-constrained settings. This study investigates the effect of various preprocessing strategies on the performance of the Text-to-Text Transfer Transformer (T5) model for primary diagnosis classification using structured clinical data. A total of 7,263 clinical records were collected from two high-density regions in Bali (Badung and Gianyar) between January 2023 and March 2024, then converted into descriptive text prompts for model training. Four experimental scenarios combined variations of input features and label configurations, comparing T5 with Oversampling against T5 with Easy Data Augmentation (EDA) plus Oversampling. Results showed that T5 with Random Oversampling consistently outperformed the EDA-based configuration across all scenarios, with performance gaps ranging from 8% to 19%. Scenario 4, which excluded body system features and the semantically overlapping E860 label, achieved the highest balance, reaching 84.7% accuracy, 85.1% precision, 84.7% recall, and 84.3% F1-score. Conversely, the EDA-based approach reduced training time by up to 72%, indicating a clear trade-off between performance and efficiency. Both configurations frequently misclassified semantically similar codes within the same ICD-10 categories, underscoring the difficulty of distinguishing clinically related diagnoses. Overall, the results suggest that careful selection of preprocessing strategies can enhance transformer-based medical text classification, while striking a balance between model performance and training efficiency. This work may serve as an initial reference for developing more efficient semi-automated medical coding systems in the Indonesian regional healthcare context.