Jurnal Teknik Informatika (JUTIF)
Vol. 6 No. 6 (2025): JUTIF Volume 6, Number 6, Desember 2025

Complex Word Identification in Indonesian Children’s Texts: An IndoBERT Baseline and Error Analysis

Lisnawita, Lisnawita (Unknown)
Bakar, Juhaida Abu (Unknown)
Rasli, Ruziana Mohamad (Unknown)
Costaner, Loneli (Unknown)
Guntoro, Guntoro (Unknown)



Article Info

Publish Date
05 Jan 2026

Abstract

Complex Word Identification (CWI) is a crucial step for building text simplification systems, especially for Indonesian children’s reading materials where unfamiliar vocabulary can hinder comprehension. This study formulates token-level CWI for Indonesian children’s texts and establishes two baselines:  an interpretable rule-based model using linguistic features e.g., length, syllable heuristics, and affix patterns, and an IndoBERT model fine-tuned for token classification. This study construct and annotate a children’s text corpus and evaluate both approaches using standard classification metrics. On the test set (22.584 tokens), IndoBERT achieves an F1-score of 0.9972 for the CWI class, substantially outperforming the rule-based baseline (F1 = 0.8607). The IndoBERT system makes only 39 errors (23 false positives and 16 false negatives), indicating near-perfect performance under the evaluated setting. Furthermore, this study provides an error analysis to highlight remaining failure patterns and borderline cases that are difficult even for contextual models. The resulting benchmark and findings contribute to Informatics/Computer Science by providing a strong baseline and analysis for educational NLP in a low-resource language setting, supporting the development of Indonesian child-oriented NLP resources and downstream text simplification tools.

Copyrights © 2025






Journal Info

Abbrev

jurnal

Publisher

Subject

Computer Science & IT

Description

Jurnal Teknik Informatika (JUTIF) is an Indonesian national journal, publishes high-quality research papers in the broad field of Informatics, Information Systems and Computer Science, which encompasses software engineering, information system development, computer systems, computer network, ...