Garuda - Garba Rujukan Digital

Jurnal INFOTEL

Vol 17 No 4 (2025): November

Ary Suryadi (Universitas Nusa Mandiri, Indonesia)
Dedi Dwi Saputra (Universitas Siber Indonesia, Indonesia)
Windu Gata (Universitas Nusa Mandiri, Indonesia)
Riza Fahlapi (Universitas Bina Sarana Informatika, Indonesia)
Angge Firizkiansah (Universitas Sains Indonesia, Indonesia)
Nuryani Mawar Putri (Universitas Sains Indonesia, Indonesia)

Publish Date
03 Jan 2026

This paper introduces a novel framework for automated question-answering (QA) dataset construction, integrating information retrieval (IR) with a lightweight local large language model (LLM), SmolLM2- 360M-Instruct, to ensure privacy and scalability for domain-specific applications. Addressing the limitations of manual dataset creation and cloud-based LLMs, our approach leverages PyPDF2 for robust PDF text extraction and a novel sentence segmentation algorithm to generate concise, contextually relevant QA pairs from domain-specific corpora. The framework employs IR techniques to align questions with precise answers, enhancing dataset quality while maintaining data privacy through localized processing. Rigorous evaluation using automated metrics and manual expert review confirms the high quality and semantic alignment of the generated QA pairs. This approach offers significant benefits for fine-tuning LLMs in niche domains, such as education and technical support, by providing scalable, privacy-preserving datasets that improve contextual understanding and adaptability. Our work contributes to efficient NLP dataset generation, offering a robust solution for advancing LLM performance in specialized real-world applications.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

Check in Google Scholar

Journal Info

Jurnal INFOTEL

Website

Abbrev

infotel

Publisher

Universitas Telkom

Subject

Computer Science & IT Electrical & Electronics Engineering

Description

Jurnal INFOTEL is a scientific journal published by Lembaga Penelitian dan Pengabdian Masyarakat (LPPM) of Institut Teknologi Telkom Purwokerto, Indonesia. Jurnal INFOTEL covers the field of informatics, telecommunication, and electronics. First published in 2009 for a printed version and published ...

Article Info

Abstract

Privacy-Preserving Automated QA Dataset Generation for Fine-Tuning LLMs with Local Models and Information Retrieval

Article Info

Abstract