Tackling RAG Challenges: A Developer's Pursuit of Citation Accuracy and Halving Hallucinations
Building a robust Retrieval-Augmented Generation (RAG) pipeline is a common objective for many developers, often forming a key developer OKR (Objective and Key Result) in the pursuit of more intelligent applications. However, even with sophisticated setups, persistent issues like inaccurate citations and LLM hallucinations can severely impact user experience. A recent GitHub discussion highlighted these very challenges in a custom Parent-Child RAG pipeline, offering valuable community insights and practical solutions.
The Sophisticated Setup and Its Snags
IchNarA, the discussion author, detailed a RAG stack designed for a local university study assistant. Their architecture included:
- Document Processing: MinerU + PyMuPDF to Markdown.
- Chunking: Custom ParentChildChunker (MarkdownHeaderTextSplitter + RecursiveCharacterTextSplitter) with large parent sections (~300-2400 chars) and smaller child chunks (~500 chars).
- Vector Store: FAISS (multilingual-e5-base) + BM25 hybrid with RRF fusion.
- Reranking: Cross-encoder (mmarco-mMiniLMv2-L12-H384-v1).
- Context Building: Retrieve → rerank → parent expansion (using ParentStore) → limited to ~9000 chars.
- Generation: LangGraph pipeline (rewrite → retrieve → rerank → expand → generate) with Gemma3:4B (Ollama), low temperature (0.0-0.1), and a repeat penalty (1.15).
Despite this advanced setup, two critical issues emerged: wrong/inconsistent page citations (model cites pages that don't contain information, or UI shows different pages) and occasional hallucinations + repetition (model repeats phrases or adds ungrounded information).
Solving the Citation Conundrum
The core problem identified by the community was a mismatch between the retrieval units (child chunks) and the generation/citation units (expanded parents). When the LLM generates from parent content, the UI's source_docs are still derived from child chunks, leading to incorrect page references.
Gecko51's Solution: Aligning source_docs
The immediate fix suggested was to modify the _expand_node function in rag_graph.py. Instead of populating source_docs from reranked child docs, it should rebuild them from the expanded parent list. A more refined approach involves capturing child-level page references before expansion and reattaching them to the parent documents.
# Before parent expansion - save child page refs
child_page_map: dict[str, list[int]] = {}
for doc, _ in reranked_docs: # Assuming reranked_docs is available here
pid = doc.metadata.get("parent_id")
page = doc.metadata.get("page")
if pid and page is not None:
child_page_map.setdefault(pid, []).append(page)
# ... (perform parent expansion) ...
for parent in expanded_parents: # Assuming expanded_parents is the result
pid = parent.metadata.get("parent_id") or parent.metadata.get("id")
matched = child_page_map.get(pid, [])
parent.metadata["cited_pages"] = sorted(set(matched))
This ensures that the source_docs returned to the UI carry the precise child-level page information, significantly improving citation accuracy.
Musaabhasan's Evidence Table Approach
For more robust citation, musaabhasan proposed keeping retrieval, generation, and citation units distinct. This involves passing structured evidence (e.g., an evidence table with ID, page, text span) to the LLM and post-validating every citation against this table.
Taming Hallucinations and Repetition
Hallucinations and repetition, especially with smaller local models like Gemma3:4B, often stem from context overload and insufficient model constraints.
- Context Size: The 9000-character context budget for a 4B model is quite aggressive. Gecko51 recommended reducing it to 5000-6000 characters.
- Strict System Prompt: Reinforcing grounding instructions in the system prompt for
_generate_nodeis crucial.
system_prompt = (
"Answer the question using ONLY the context provided. "
"If the answer is not in the context, say: I don't know. "
"Do not add any information that is not explicitly stated in the context."
)
repeat_penalty=1.15, Gecko51 suggested ensuring this is correctly applied in Ollama parameters, noting that num_ctx=8192 combined with this could still lead to loops in dense contexts.These adjustments help the LLM stay grounded and prevent it from generating ungrounded or repetitive text, aligning with the developer OKR of building reliable AI tools.
Key Takeaways for Developers
This discussion underscores that even with a sophisticated RAG architecture, meticulous alignment between document processing, retrieval, and generation is paramount. For developers optimizing RAG pipelines, focusing on:
- Ensuring citation metadata accurately reflects the content used for generation.
- Managing context windows appropriate for the LLM size.
- Implementing strict grounding prompts and repetition penalties.
can significantly improve the reliability and user experience of their AI applications, directly contributing to their development objectives.
