This Author published in this journals
All Journal Jurnal INFOTEL
Riza Fahlapi
Universitas Bina Sarana Informatika, Indonesia

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Privacy-Preserving Automated QA Dataset Generation for Fine-Tuning LLMs with Local Models and Information Retrieval Ary Suryadi; Dedi Dwi Saputra; Windu Gata; Riza Fahlapi; Angge Firizkiansah; Nuryani Mawar Putri
JURNAL INFOTEL Vol 17 No 4 (2025): November
Publisher : LPPM INSTITUT TEKNOLOGI TELKOM PURWOKERTO

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.20895/infotel.v17i4.1388

Abstract

This paper introduces a novel framework for automated question-answering (QA) dataset construction, integrating information retrieval (IR) with a lightweight local large language model (LLM), SmolLM2- 360M-Instruct, to ensure privacy and scalability for domain-specific applications. Addressing the limitations of manual dataset creation and cloud-based LLMs, our approach leverages PyPDF2 for robust PDF text extraction and a novel sentence segmentation algorithm to generate concise, contextually relevant QA pairs from domain-specific corpora. The framework employs IR techniques to align questions with precise answers, enhancing dataset quality while maintaining data privacy through localized processing. Rigorous evaluation using automated metrics and manual expert review confirms the high quality and semantic alignment of the generated QA pairs. This approach offers significant benefits for fine-tuning LLMs in niche domains, such as education and technical support, by providing scalable, privacy-preserving datasets that improve contextual understanding and adaptability. Our work contributes to efficient NLP dataset generation, offering a robust solution for advancing LLM performance in specialized real-world applications.