This paper introduces a novel framework for automated question-answering (QA) dataset construction, integrating information retrieval (IR) with a lightweight local large language model (LLM), SmolLM2- 360M-Instruct, to ensure privacy and scalability for domain-specific applications. Addressing the limitations of manual dataset creation and cloud-based LLMs, our approach leverages PyPDF2 for robust PDF text extraction and a novel sentence segmentation algorithm to generate concise, contextually relevant QA pairs from domain-specific corpora. The framework employs IR techniques to align questions with precise answers, enhancing dataset quality while maintaining data privacy through localized processing. Rigorous evaluation using automated metrics and manual expert review confirms the high quality and semantic alignment of the generated QA pairs. This approach offers significant benefits for fine-tuning LLMs in niche domains, such as education and technical support, by providing scalable, privacy-preserving datasets that improve contextual understanding and adaptability. Our work contributes to efficient NLP dataset generation, offering a robust solution for advancing LLM performance in specialized real-world applications.
Copyrights © 2025