Journal of Applied Data Sciences
Vol 7, No 2: May 2026

A Hybrid TF-IDF and Knowledge Graph-Enhanced Retrieval-Augmented Generation Framework with Large Language Models for Domain-Aware Question Answering

Utami, Lilyani Asri (Unknown)
Rachmi, Hilda (Unknown)
Hidayatulloh, Syarif (Unknown)



Article Info

Publish Date
17 Mar 2026

Abstract

This study aims to develop a domain-aware legal Question-Answering (QA) system tailored for Indonesia’s Micro, Small, and Medium Enterprises (MSMEs) by proposing a hybrid Retrieval-Augmented Generation (RAG) framework that integrates Term Frequency–Inverse Document Frequency (TF-IDF), Knowledge Graph (KG), and Large Language Model (LLM) components. In this framework, TF-IDF contributes by performing lexical-level retrieval to identify the most relevant documents based on keyword weighting; the KG enriches this retrieval by providing semantic relationships among legal entities, enabling deeper contextual understanding; and the LLM generates coherent responses conditioned on both lexical and semantically grounded evidence. Together, these components work synergistically to strengthen factual grounding during retrieval and improve contextual reasoning during generation. Methodologically, the system processes a curated dataset of 1,400 legal question–answer pairs collected from national legal repositories, including legislation, government regulations, and MSME digitalization guidelines. The process includes text preprocessing, keyword extraction using TF-IDF, semantic enrichment through a KG that maps legal entities and their relationships, and answer generation via an LLM powered by the RAG pipeline. The system was evaluated using Precision, Recall, F1-Score, Bilingual Evaluation Understudy (BLEU), and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics, validated by five legal experts. Results show an accuracy improvement from 76.5% to 83.5% after integrating KG, with Precision of 0.853, Recall of 0.877, and F1-Score of 0.865. The generative evaluation yielded a BLEU score of 0.9276 and ROUGE-L of 0.9301, indicating strong linguistic and semantic alignment between system outputs and expert-authored references. The study concludes that this approach offers a practical foundation for building AI-based legal assistance tools and highlights future opportunities for expansion to other legal domains and multilingual RAG applications.

Copyrights © 2026






Journal Info

Abbrev

JADS

Publisher

Subject

Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management

Description

One of the current hot topics in science is data: how can datasets be used in scientific and scholarly research in a more reliable, citable and accountable way? Data is of paramount importance to scientific progress, yet most research data remains private. Enhancing the transparency of the processes ...