Optimizing RAG Pipelines: Solutions for Citation Mismatch, LLM Hallucinations, and Repetition in Custom Stacks

Building a robust Retrieval-Augmented Generation (RAG) pipeline is a common objective for many developers, often forming a key developer OKR (Objective and Key Result) in the pursuit of more intelligent applications. However, even with sophisticated setups, persistent issues like inaccurate citations and LLM hallucinations can severely impact user experience. A recent GitHub discussion highlighted these very challenges in a custom Parent-Child RAG pipeline, offering valuable community insights and practical solutions.

A developer analyzing a RAG pipeline diagram, highlighting issues with data flow and citations.

The Sophisticated Setup and Its Snags

IchNarA, the discussion author, detailed a RAG stack designed for a local university study assistant. Their architecture included:

Document Processing: MinerU + PyMuPDF to Markdown.
Chunking: Custom ParentChildChunker (MarkdownHeaderTextSplitter + RecursiveCharacterTextSplitter) with large parent sections (~300-2400 chars) and smaller child chunks (~500 chars).
Vector Store: FAISS (multilingual-e5-base) + BM25 hybrid with RRF fusion.
Reranking: Cross-encoder (mmarco-mMiniLMv2-L12-H384-v1).
Context Building: Retrieve → rerank → parent expansion (using ParentStore) → limited to ~9000 chars.
Generation: LangGraph pipeline (rewrite → retrieve → rerank → expand → generate) with Gemma3:4B (Ollama), low temperature (0.0-0.1), and a repeat penalty (1.15).

Despite this advanced setup, two critical issues emerged: wrong/inconsistent page citations (model cites pages that don't contain information, or UI shows different pages) and occasional hallucinations + repetition (model repeats phrases or adds ungrounded information).

A visual representation of accurate and inaccurate document citations in a knowledge base.

Solving the Citation Conundrum

The core problem identified by the community was a mismatch between the retrieval units (child chunks) and the generation/citation units (expanded parents). When the LLM generates from parent content, the UI's source_docs are still derived from child chunks, leading to incorrect page references.

Gecko51's Solution: Aligning source_docs

The immediate fix suggested was to modify the _expand_node function in rag_graph.py. Instead of populating source_docs from reranked child docs, it should rebuild them from the expanded parent list. A more refined approach involves capturing child-level page references before expansion and reattaching them to the parent documents.

# Before parent expansion - save child page refs
child_page_map: dict[str, list[int]] = {}
for doc, _ in reranked_docs: # Assuming reranked_docs is available here
    pid = doc.metadata.get("parent_id")
    page = doc.metadata.get("page")
    if pid and page is not None:
        child_page_map.setdefault(pid, []).append(page)

# ... (perform parent expansion) ...

for parent in expanded_parents: # Assuming expanded_parents is the result
    pid = parent.metadata.get("parent_id") or parent.metadata.get("id")
    matched = child_page_map.get(pid, [])
    parent.metadata["cited_pages"] = sorted(set(matched))

This ensures that the source_docs returned to the UI carry the precise child-level page information, significantly improving citation accuracy.

Musaabhasan's Evidence Table Approach

For more robust citation, musaabhasan proposed keeping retrieval, generation, and citation units distinct. This involves passing structured evidence (e.g., an evidence table with ID, page, text span) to the LLM and post-validating every citation against this table.

Taming Hallucinations and Repetition

Hallucinations and repetition, especially with smaller local models like Gemma3:4B, often stem from context overload and insufficient model constraints.

Context Size: The 9000-character context budget for a 4B model is quite aggressive. Gecko51 recommended reducing it to 5000-6000 characters.
Strict System Prompt: Reinforcing grounding instructions in the system prompt for _generate_node is crucial.

system_prompt = (
"Answer the question using ONLY the context provided. "
"If the answer is not in the context, say: I don't know. "
"Do not add any information that is not explicitly stated in the context."
)

Repeat Penalty: While IchNarA already used repeat_penalty=1.15, Gecko51 suggested ensuring this is correctly applied in Ollama parameters, noting that num_ctx=8192 combined with this could still lead to loops in dense contexts.

These adjustments help the LLM stay grounded and prevent it from generating ungrounded or repetitive text, aligning with the developer OKR of building reliable AI tools.

Key Takeaways for Developers

This discussion underscores that even with a sophisticated RAG architecture, meticulous alignment between document processing, retrieval, and generation is paramount. For developers optimizing RAG pipelines, focusing on:

Ensuring citation metadata accurately reflects the content used for generation.
Managing context windows appropriate for the LLM size.
Implementing strict grounding prompts and repetition penalties.

can significantly improve the reliability and user experience of their AI applications, directly contributing to their development objectives.

Tackling RAG Challenges: A Developer's Pursuit of Citation Accuracy and Halving Hallucinations

The Sophisticated Setup and Its Snags

Solving the Citation Conundrum

Taming Hallucinations and Repetition

Key Takeaways for Developers

See Also

Gamification

Performance Review

Contributions Analytics

Work Quality Analytics

Actionable Alerts

Retrospective Insights

|