Retrieval-Augmented Generation (RAG) combines generative capabilities of language models with external document retrieval to answer questions grounded in reference texts. However, deploying RAG on low-resource devices like Android smartphones is challenging because SLMs have limited computational capacity and depend heavily on efficient chunking and retrieval. Although interest in on-device processing is growing, research on RAG configurations for SLMs under strict resource constraints especially for domain-specific tasks remains limited. This study therefore investigates which combinations of chunking technique, chunk size, overlap, and retrieval strategy best balance accuracy and speed on low-resource devices. The evaluation uses 148 Indonesian questions sourced from an official Hajj guidebook. The study consists of two phases retrieval and generation. Retrieval is evaluated using BLEU, ROUGE-L, MRR, MAP, and Hit@k, while answer quality is measured with BERTScore. The experiments compare different chunking methods (fixed-size or semantic), chunk sizes (128 or 256 tokens), overlaps (25, 50 and 100 tokens), and retrieval methods (dense, sparse, or hybrid). Results show that sparse retrieval with 256-token chunks and 100-token overlap yields the best answer quality (F1 = 0.726). However, 128-token chunks with the same overlap provide the fastest generation time (69.737 seconds). The main contribution of this study is a systematic evaluation of RAG configurations for fully on-device SLMs using a domain-specific Hajj and Umrah dataset not explored in prior research. The findings provide practical guidance for designing efficient and accurate RAG-based question-answering systems on low-resource devices.
Copyrights © 2026