p-Index From 2021 - 2026
4.791
P-Index
Claim Missing Document
Check
Articles

Found 1 Documents
Search
Journal : Building of Informatics, Technology and Science

Klasifikasi Spam Bahasa Indonesia dengan IndoBERT dan XLM-RoBERTa: Evaluasi Pooling, Stride, dan Late-Fusion Darmono, Darmono; Saputro, Rujianto Eko; Barkah, Azhari Shouni
Building of Informatics, Technology and Science (BITS) Vol 7 No 2 (2025): September 2025
Publisher : Forum Kerjasama Pendidikan Tinggi

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.47065/bits.v7i2.8034

Abstract

Spam detection for Indonesian short messages such as SMS and email remains challenging due to lexical variation, character obfuscation, and class imbalance. This study provides a systematic evaluation to determine the most balanced configuration between accuracy and efficiency for Indonesian spam filtering. We compare two pretrained backbones (IndoBERT and XLM RoBERTa), along with representation strategies (truncation versus chunking), summarization schemes (pooling), and feature fusion approaches. The system follows a feature based design with an emphasis on simplicity, and is assessed using F1 Macro, spam class recall, AUPRC (Area Under the Precision Recall Curve), and efficiency metrics in terms of embedding build time and training latency. Results indicate that IndoBERT achieves superior binary classification performance with high efficiency, while XLM RoBERTa slightly outperforms on AUPRC, making it more suitable for risk ranking scenarios. Truncation combined with mean pooling consistently yields stable results. Although late fusion only provides marginal improvements, it remains relevant as it highlights the potential of domain specific signals to enhance robustness under heavy obfuscation. The final recommendation for production is IndoBERT with truncation, mean pooling, and embedding only. Limitations include the focus on short messages and the lack of evaluation under extreme obfuscation. Future work should explore character level augmentation, cross domain evaluation, and cost sensitive threshold tuning.
Co-Authors Abdul Rahim Agus Sutikno Ahmad Sofyan Annisa 'Amali Shoolihah Apriyanto, Krisnanda Dwi Arfin Yogatama Arifah, Meika Rahmawati Azhari Shouni Barkah Bagus Putra Setiyawa Bagus Putra Setiyawan Bintang Fajar Ariyanto Daldiyono Hardjodisatro, Daldiyono Didik Purwantoro Diva Salsa Rohalya Eddy Sudijanto, Eddy Edy Herdyanto Eliyana, Widya Ella Wulandari Fadli, Putra Aprilia Fandy Setyo Utomo Faqih Ma’arif Faqih Ma’arif Farmaditya EF Mundhofir, Farmaditya EF Gandi, Margi Wisma Giri Wiyono Hatutiningsih, Arum Dwi Herita Warni Hermawan, Andi Tri Hery D Purnomo, Hery D Hudha, Mahda Enja Al Ikawijaya, Natali Ilma, Awla Akbar Indah Wahyuni Intan Lathifatur Rosyidah Jillbert , Julius Kasno Kasno Khakam Ma'ruf Khakam Ma'ruf Khakam Ma’ruf Khakam Ma’ruf Khosyiati, Nur Evirda Kurniawan, Muhammad Adi Ma'ruf, Khakam Malik, Abdul Marampa, Nikolaus Tebi Maris Setyo Nugroho Masfia, Irma Ma’ruf, Khakam Muchlis Achsan Udji Sofro Muhammad Nuruzzaman Munawaroh, Anisa Dewi Mursyidin Mursyidin Nengsi Sudirman Nugroho, Khabib Adi Nur Azizah Nur Endah Januarti Nurdiyanto, Heri Pamungkas, Rama Aji Pupus Pandef Rudianto Perdana Kusuma Putra, Aditya Halim Prahasti, Saptiana Nur Praptiwi, Wahyu Kharina Putra Pratama, Galeh Nur Indriatno R Djokomoeljanto1, R Ramdan, Ramdan Ramdiska, Rizki Ratna Dewi Mulyaningtiyas Raudhotul Jannah Rifqiansyah, Rifqiansyah Rujianto Eko Saputro Sabrin Sabrin, Sabrin Sahid Ramandhani Sahid Ramandhani Sappewali, Badriah Setiawan, Riski Setiawan, Rizal Justian Slamet Widodo SRI ATUN Sri Herlina Sudaryono Sudaryono Sultana MH Faradz Surono Surono Surono Suryadi Prasetyo Suryanto, Indra Dwi Susanto, Tofan Suwarno Suwarno Su’un, Muhammad Syahrani, Luthfi Auliya Syamsudin, Rudi Nur Syukri Fathudin Achmad Widodo Titi Safitri Maharani Turino, Turino Vicananda, Ladayan Pradana Wahid, Arif Mu'amar Wahyu Wahyu Yanuar Agung Fadlullah Yanuar Agung Fadlullah Yuli Fajarwati Zainal Arifin