Claim Missing Document
Check
Articles

Complex Word Identification in Indonesian Children’s Texts: An IndoBERT Baseline and Error Analysis Lisnawita, Lisnawita; Bakar, Juhaida Abu; Rasli, Ruziana Mohamad; Costaner, Loneli; Guntoro, Guntoro
Jurnal Teknik Informatika (Jutif) Vol. 6 No. 6 (2025): JUTIF Volume 6, Number 6, Desember 2025
Publisher : Informatika, Universitas Jenderal Soedirman

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.52436/1.jutif.2025.6.6.5501

Abstract

Complex Word Identification (CWI) is a crucial step for building text simplification systems, especially for Indonesian children’s reading materials where unfamiliar vocabulary can hinder comprehension. This study formulates token-level CWI for Indonesian children’s texts and establishes two baselines:  an interpretable rule-based model using linguistic features e.g., length, syllable heuristics, and affix patterns, and an IndoBERT model fine-tuned for token classification. This study construct and annotate a children’s text corpus and evaluate both approaches using standard classification metrics. On the test set (22.584 tokens), IndoBERT achieves an F1-score of 0.9972 for the CWI class, substantially outperforming the rule-based baseline (F1 = 0.8607). The IndoBERT system makes only 39 errors (23 false positives and 16 false negatives), indicating near-perfect performance under the evaluated setting. Furthermore, this study provides an error analysis to highlight remaining failure patterns and borderline cases that are difficult even for contextual models. The resulting benchmark and findings contribute to Informatics/Computer Science by providing a strong baseline and analysis for educational NLP in a low-resource language setting, supporting the development of Indonesian child-oriented NLP resources and downstream text simplification tools.
Optimizing Random Forest for IoT Cyberattack Detection using SMOTE: A Study on CIC-IoT2023 Dataset Guntoro Guntoro; Lisnawita Lisnawita; Loneli Costaner
MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer Vol. 25 No. 1 (2025)
Publisher : Universitas Bumigora

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30812/matrik.v25i1.5382

Abstract

The growing number of Internet of Things devices has led to an increased risk of complex and diverse cyberattacks. However, a significant challenge in this domain is the imbalanced class distribution in most Internet of Things datasets, cautilizing classification algorithms to be biased towards the majority class, hindering effective threat detection. This study addresses this issue by leveraging the Random Forest algorithm optimised by the Synthetic Minority Oversampling Technique. This research aims to develop an effective model for detecting cyberattacks in Internet of Things environments by resolving class imbalance issues inside of the CIC-IoT2023 dataset. The methodology involves several stages, comprising data preprocessing and applying Synthetic Minority Oversampling Technique for data balancing. The balanced dataset was then used to train a Random Forest model, by its performance evaluated utilizing accuracy, precision, recall, F1-score, and Cohen's Kappa metrics. The results demonstrate the model's effectiveness, achieving an accuracy of 99.01%, an F1-score of 98.96%, and a Cohen's Kappa of 98.92%. This marks a notable improvement in performance, particularly in detecting minority classes, compared to the model trained devoid of Synthetic Minority Oversampling Technique, that struggled to identify several less common attack types. The outcomes suggest that combining Random Forest by Synthetic Minority Oversampling Technique can significantly enhance the development of intrusion detection systems by improving detection accuracy for all 33 attack types and reducing the risks associated by undetected threats. In conclusion, this study advances Internet of Things cybersecurity by presenting an effective and efficient method for addressing data imbalance in attack detection. Future research should focus on evaluating the model's robustness utilizing more complex datasets and enhancing its performance for real-time deployment on resource-constrained Internet of Things Devices.