Du, Yu
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Evaluating document chunking approaches for retrieval augmented generation in editorial content Lavarec, Erwann; Du, Yu
IAES International Journal of Artificial Intelligence (IJ-AI) Vol 15, No 2: April 2026
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/ijai.v15.i2.pp1909-1918

Abstract

Retrieval-augmented generation (RAG) systems promise grounded answers from large language models (LLMs), yet performance depends critically on how source documents are segmented before indexing. This study investigates how pre-index chunking strategies affect both retrieval accuracy and answer quality in domain-specific scenarios. We curated a corpus on software-as-a-service (SaaS) editorial content and constructed a high-quality evaluation dataset containing 2,419 question-answer (QA) pairs generated through automated prompting and quality control. We compared four chunking approaches, including fixed-size, structure-aware recursive, semantic, and LLM-based methods. Our evaluation protocol assessed retrieval through document localization, semantic similarity, and context relevance, while generation quality was evaluated using chain-of-thought (CoT) criteria driven by judgments from LLMs. Results demonstrate that recursive chunking consistently outperforms other approaches across all metrics. Smaller chunks improve document localization, while moderately larger chunks enhance semantic alignment and generation scores. LLM based chunking variants show competitive performance but do not exceed top recursive configurations on the dataset. These findings indicate that preserving document structure through recursive chunking is beneficial for practical RAG implementations, providing actionable guidance for chunk size selection while highlighting token-budget constraints in current long context models.