Rasli, Ruziana Mohamad
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Complex Word Identification in Indonesian Children’s Texts: An IndoBERT Baseline and Error Analysis Lisnawita, Lisnawita; Bakar, Juhaida Abu; Rasli, Ruziana Mohamad; Costaner, Loneli; Guntoro, Guntoro
Jurnal Teknik Informatika (Jutif) Vol. 6 No. 6 (2025): JUTIF Volume 6, Number 6, Desember 2025
Publisher : Informatika, Universitas Jenderal Soedirman

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.52436/1.jutif.2025.6.6.5501

Abstract

Complex Word Identification (CWI) is a crucial step for building text simplification systems, especially for Indonesian children’s reading materials where unfamiliar vocabulary can hinder comprehension. This study formulates token-level CWI for Indonesian children’s texts and establishes two baselines:  an interpretable rule-based model using linguistic features e.g., length, syllable heuristics, and affix patterns, and an IndoBERT model fine-tuned for token classification. This study construct and annotate a children’s text corpus and evaluate both approaches using standard classification metrics. On the test set (22.584 tokens), IndoBERT achieves an F1-score of 0.9972 for the CWI class, substantially outperforming the rule-based baseline (F1 = 0.8607). The IndoBERT system makes only 39 errors (23 false positives and 16 false negatives), indicating near-perfect performance under the evaluated setting. Furthermore, this study provides an error analysis to highlight remaining failure patterns and borderline cases that are difficult even for contextual models. The resulting benchmark and findings contribute to Informatics/Computer Science by providing a strong baseline and analysis for educational NLP in a low-resource language setting, supporting the development of Indonesian child-oriented NLP resources and downstream text simplification tools.