Document classification in low-resource languages remains a critical challenge due to the scarcity of annotated datasets, language-specific resources, and linguistic tools. This study investigates the effectiveness of zero-shot learning (ZSL) for multilingual document classification, with a specific focus on low-resource Southeast Asian languages: Javanese, Sundanese, and Malay. We adopt a zero-shot cross-lingual transfer approach, using English-labeled data as the source domain and evaluating on unseen target-language documents without any supervised fine-tuning. Specifically, we employ two state-of-the-art multilingual transformer models, XLM-RoBERTa (XLM-R) and Multilingual T5 (mT5), to evaluate their ability to generalize across linguistically distant languages. Experimental results show that XLM-R achieves higher average accuracy (≈78%) and F1 Score (≈0.76) than mT5 (≈74% accuracy, 0.72 F1), demonstrating stronger transferability and stability. Both models exhibit efficient inference speed and manageable computational costs, indicating potential for deployment in resource-constrained environments. The findings introduce an early benchmark for zero-shot multilingual document classification in Southeast Asian languages and highlight the feasibility of inclusive NLP systems that bridge the data gap for underrepresented linguistic communities.
Copyrights © 2025