RAG Pipeline: Fix Citations, Stop Hallucinations, Boost Developer OKRs

Building a robust Retrieval-Augmented Generation (RAG) pipeline is a common objective for many developers, often forming a key developer OKR in the pursuit of more intelligent applications. However, even with sophisticated setups, persistent issues like inaccurate citations and LLM hallucinations can severely impact user experience and undermine the reliability of AI-powered tools. A recent GitHub discussion highlighted these very challenges in a custom Parent-Child RAG pipeline, offering valuable community insights and practical solutions for enhancing development-integrations.

The Sophisticated Setup and Its Snags

IchNarA, the discussion author, detailed an impressive RAG stack designed for a local university study assistant. Their architecture was anything but basic, featuring:

Document Processing: MinerU + PyMuPDF to Markdown for high-quality text extraction.
Chunking: A custom ParentChildChunker using MarkdownHeaderTextSplitter and RecursiveCharacterTextSplitter. This created larger parent sections (~300-2400 chars) for context and smaller child chunks (~500 chars) for precise retrieval.
Vector Store: A hybrid approach combining FAISS (for dense vectors from multilingual-e5-base embeddings) and BM25 (for sparse keyword matching), fused with Reciprocal Rank Fusion (RRF).
Reranking: A cross-encoder (mmarco-mMiniLMv2-L12-H384-v1) to refine retrieval results.
Context Building: A multi-stage process of retrieve → rerank → parent expansion, limited to approximately 9000 characters.
Generation: A LangGraph pipeline (rewrite → retrieve → rerank → expand → generate) powered by Gemma3:4B via Ollama, with a low temperature (0.0-0.1) and a repeat penalty (1.15).

Despite this advanced tooling, two critical issues emerged that directly impacted user trust and the system's overall effectiveness: wrong/inconsistent page citations (where the model cited pages that didn't contain the claimed information, or the UI showed different pages) and occasional hallucinations + repetition (the model repeating phrases or adding plausible but ungrounded information).

Illustration demonstrating parent-child chunking with child-level page references being preserved and attached to expanded parent documents.

Unpacking the Citation Conundrum: When Sources Misalign

The core problem, as identified by the community, wasn't necessarily poor retrieval, but a fundamental misalignment between the units used for retrieval, generation, and citation. IchNarA's setup retrieved precise child chunks, but then expanded to larger parent documents for the LLM's generation context. Critically, the source_docs passed to the UI for citation still originated from these smaller child chunks, while the LLM's answer was based on the broader parent content. This discrepancy led to citations pointing to child chunks that might not fully encompass the model's generated statement, or worse, pointing to pages that didn't contain the information the LLM inferred from the expanded parent.

This highlights a crucial distinction in sophisticated RAG pipelines:

Retrieval Unit: The smallest, most granular piece of information used to find relevant data (e.g., child chunks).
Generation Unit: The broader context provided to the LLM to synthesize an answer (e.g., parent documents).
Citation Unit: The exact, verifiable reference point for a factual claim (e.g., specific page numbers or text spans).

When these units are not meticulously aligned, citation accuracy suffers, directly impacting the reliability of your development-integrations.

The Fix: Precision in Citation Alignment

The most robust solution involves preserving the granular page references from the child chunks and explicitly associating them with the larger parent documents used for generation. Here's a refined approach:

Pre-capture Child Page References: Before expanding to parent documents, iterate through the initially reranked child chunks. For each child chunk, extract its parent_id and its precise page number. Store these in a map, associating parent IDs with a list of all child-level pages that contributed to that parent's retrieval.
Attach to Parent Metadata: After expanding to the parent documents, iterate through these parents. For each parent, retrieve the list of precise child-level pages from your pre-captured map. Attach this list (e.g., as parent.metadata["cited_pages"]) to the parent document itself.
Derive UI Source Docs from Parents: Ensure that the source_docs returned to the UI are now derived from these enriched parent documents. The UI can then display the parent document's content, but reference the specific cited_pages that were originally matched at the child level. This provides both broad context and granular citation accuracy.

This approach ensures that even when the LLM generates from a larger context, the citation mechanism retains the precision of the initial child-level hits. The _expand_node function in rag_graph.py would need modification to implement this logic, ensuring that the source_docs passed to the UI accurately reflect the pages that *triggered* the retrieval, even if the LLM generates from a broader parent.

Visualizing an LLM constrained by a strict system prompt and optimized context window to prevent hallucinations and repetitive output.

Prompt Engineering for Citation Integrity

Beyond architectural changes, the LLM's instructions are paramount. A highly explicit system prompt is essential:

"Only cite page numbers that appear in the provided context. If you cannot confirm the exact page, do not cite it."

Some advanced pipelines even feed the LLM an 'evidence table' mapping specific text spans to page numbers, forcing it to cite from this structured evidence. Post-validation of citations against this evidence table can further reduce errors, rejecting or regenerating answers with ungrounded citations.

Taming the LLM: Halting Hallucinations and Repetition

Small, local models like Gemma3:4B are powerful but require careful management to prevent common pitfalls like hallucinations and repetitive output. Several strategies emerged from the discussion:

Context Window Management: IchNarA was pushing Gemma3:4B with a context budget of ~9000 characters, close to its 8k token limit. For smaller models, this can lead to 'context density' issues, where the model struggles to process and prioritize information effectively. Reducing the context to 5000-6000 characters often significantly improves output quality and reduces the likelihood of the model getting 'lost' or repetitive.
Repeat Penalty: While IchNarA already used repeat_penalty=1.15, slightly lower values like 1.1 (or tuning based on specific model behavior) can be even more effective in preventing the model from looping on phrases, especially when combined with a tighter context window. This is typically configured directly in your Ollama parameters.
Strict Grounding Prompts: Reinforce the LLM's adherence to the provided context. A system prompt like: "Answer the question using ONLY the context provided. If the answer is not in the context, say: I could not find this information in the uploaded documents. Do not add any information that is not explicitly stated in the context." is crucial. This explicit instruction minimizes the model's tendency to 'fill in the blanks' with its pre-trained knowledge, a common source of hallucinations.
Temperature and Output Length: While IchNarA's temperature (0.0-0.1) is already very low, capping the output length can prevent runaway generation. For critical applications, forcing a simple answer schema (e.g., answer, cited evidence IDs, and an uncertainty note) can further constrain the model and make its output easier to validate.

Impact on Productivity and Delivery

For dev teams, product managers, and CTOs, the implications of these fixes extend beyond mere technical elegance. A RAG pipeline that consistently delivers accurate, grounded, and non-repetitive answers directly contributes to:

Improved User Trust: Reliable citations mean users can verify information, fostering confidence in the AI assistant.
Reduced Debugging & Maintenance: Fewer hallucinations and citation errors mean less time spent by developers identifying and fixing model misbehavior, freeing up resources for new features. This directly impacts developer OKRs related to efficiency and quality.
Faster Feature Delivery: A stable and predictable RAG core allows for quicker iteration and deployment of new AI-powered functionalities, accelerating overall project delivery.
Enhanced Tooling & Integrations: By making the RAG system more robust, it becomes a more valuable component in your broader suite of development-integrations, enabling more sophisticated applications across the organization.

The journey to a perfectly grounded RAG system is iterative, but by addressing these core challenges with meticulous engineering and community insights, organizations can build AI tools that are not just intelligent, but also trustworthy and highly effective.

Precision RAG: Fixing Citations & Hallucinations for Stronger Developer OKRs

The Sophisticated Setup and Its Snags

Unpacking the Citation Conundrum: When Sources Misalign

The Fix: Precision in Citation Alignment

Prompt Engineering for Citation Integrity

Taming the LLM: Halting Hallucinations and Repetition

Impact on Productivity and Delivery

See Also

Gamification

Performance Review

Contributions Analytics

Work Quality Analytics

Actionable Alerts

Retrospective Insights

|