๐ŸŽฏ Context & Problem

RAG (Retrieval-Augmented Generation) systems are increasingly used in production to ground LLM responses in real data. However, even with retrieval, LLMs can still hallucinate โ€” generating plausible-sounding but factually incorrect or unsupported content.

The challenge is twofold: detecting hallucinations reliably (without human review at scale) and correcting them automatically in a feedback loop. Existing evaluation tools either require labeled datasets or don't integrate into existing RAG pipelines easily.

This project builds a complete, modular evaluation framework that can be plugged into any RAG pipeline to measure groundedness, detect hallucinations, and trigger corrective generation.

๐Ÿ—๏ธ Technical Architecture

The system follows a judge-refine loop orchestrated by LangGraph, with ChromaDB as the vector store and Claude API as the evaluation judge.

User Query
โ†’
Retriever (ChromaDB)
โ†’
Context Chunks
โ†“
Generator (LLM)
โ†’
RAG Response
โ†“
Claude API (Judge)
โ†’
Hallucination Score
โ†’
Pass / Refine
โ†“ (if Refine)
Corrective Generator
โ†’
Grounded Response
  • LangGraph orchestrates the evaluation-correction loop as a stateful graph โ€” each node is a distinct step (retrieve, generate, judge, refine)
  • Claude API as Judge evaluates each response on 3 dimensions: groundedness, faithfulness, and completeness โ€” returning a structured JSON score
  • ChromaDB stores document embeddings and serves as the retrieval backend with cosine similarity search
  • FastAPI exposes the evaluation pipeline as a REST API, making it pluggable into any existing system
  • The loop runs up to N iterations โ€” stopping when the score exceeds a configurable threshold or max attempts are reached

๐Ÿ“Š Results & Key Insights

3
Evaluation dimensions
(groundedness, faithfulness, completeness)
~2
Avg. correction loops to reach passing score
REST
API-first design, pluggable into any RAG pipeline
  • The judge-refine loop consistently improved groundedness scores across test sets of educational content
  • Structured JSON output from Claude API enabled fine-grained diagnosis โ€” identifying which claim was hallucinated, not just a binary pass/fail
  • Related to research published at ISMIS 2026: "Diagnosing Hallucinations in RAG-Based Educational Recommendation Systems"

๐Ÿ”— Related Work

  • Directly related to the ISMIS 2026 paper on hallucination diagnosis in educational RAG systems
  • Built on top of the BERT4Rec + RAG Recommender pipeline as the evaluation layer
  • Techniques applied in production at Inokufu to validate RAG-generated educational recommendations