๐ฏ Context & Problem
RAG (Retrieval-Augmented Generation) systems are increasingly used in production to ground LLM responses in real data. However, even with retrieval, LLMs can still hallucinate โ generating plausible-sounding but factually incorrect or unsupported content.
The challenge is twofold: detecting hallucinations reliably (without human review at scale) and correcting them automatically in a feedback loop. Existing evaluation tools either require labeled datasets or don't integrate into existing RAG pipelines easily.
This project builds a complete, modular evaluation framework that can be plugged into any RAG pipeline to measure groundedness, detect hallucinations, and trigger corrective generation.
๐๏ธ Technical Architecture
The system follows a judge-refine loop orchestrated by LangGraph, with ChromaDB as the vector store and Claude API as the evaluation judge.
- LangGraph orchestrates the evaluation-correction loop as a stateful graph โ each node is a distinct step (retrieve, generate, judge, refine)
- Claude API as Judge evaluates each response on 3 dimensions: groundedness, faithfulness, and completeness โ returning a structured JSON score
- ChromaDB stores document embeddings and serves as the retrieval backend with cosine similarity search
- FastAPI exposes the evaluation pipeline as a REST API, making it pluggable into any existing system
- The loop runs up to N iterations โ stopping when the score exceeds a configurable threshold or max attempts are reached
๐ Results & Key Insights
(groundedness, faithfulness, completeness)
- The judge-refine loop consistently improved groundedness scores across test sets of educational content
- Structured JSON output from Claude API enabled fine-grained diagnosis โ identifying which claim was hallucinated, not just a binary pass/fail
- Related to research published at ISMIS 2026: "Diagnosing Hallucinations in RAG-Based Educational Recommendation Systems"
๐ Related Work
- Directly related to the ISMIS 2026 paper on hallucination diagnosis in educational RAG systems
- Built on top of the BERT4Rec + RAG Recommender pipeline as the evaluation layer
- Techniques applied in production at Inokufu to validate RAG-generated educational recommendations