RAG Hallucination Evaluator — Ichrak Ennaceur

🎯 Context & Problem

RAG (Retrieval-Augmented Generation) systems are increasingly used in production to ground LLM responses in real data. However, even with retrieval, LLMs can still hallucinate — generating plausible-sounding but factually incorrect or unsupported content.

The challenge is twofold: detecting hallucinations reliably (without human review at scale) and correcting them automatically in a feedback loop. Existing evaluation tools either require labeled datasets or don't integrate into existing RAG pipelines easily.

This project builds a complete, modular evaluation framework that can be plugged into any RAG pipeline to measure groundedness, detect hallucinations, and trigger corrective generation.

🏗️ Technical Architecture

The system follows a judge-refine loop orchestrated by LangGraph, with ChromaDB as the vector store and Claude API as the evaluation judge.

User Query

→

Retriever (ChromaDB)

→

Context Chunks

↓

Generator (LLM)

→

RAG Response

↓

Claude API (Judge)

→

Hallucination Score

→

Pass / Refine

↓ (if Refine)

Corrective Generator

→

Grounded Response

LangGraph orchestrates the evaluation-correction loop as a stateful graph — each node is a distinct step (retrieve, generate, judge, refine)
Claude API as Judge evaluates each response on 3 dimensions: groundedness, faithfulness, and completeness — returning a structured JSON score
ChromaDB stores document embeddings and serves as the retrieval backend with cosine similarity search
FastAPI exposes the evaluation pipeline as a REST API, making it pluggable into any existing system
The loop runs up to N iterations — stopping when the score exceeds a configurable threshold or max attempts are reached

📊 Results & Key Insights

Evaluation dimensions
(groundedness, faithfulness, completeness)

Avg. correction loops to reach passing score

REST

API-first design, pluggable into any RAG pipeline

The judge-refine loop consistently improved groundedness scores across test sets of educational content
Structured JSON output from Claude API enabled fine-grained diagnosis — identifying which claim was hallucinated, not just a binary pass/fail
Related to research published at ISMIS 2026: "Diagnosing Hallucinations in RAG-Based Educational Recommendation Systems"

🔗 Related Work

Directly related to the ISMIS 2026 paper on hallucination diagnosis in educational RAG systems
Built on top of the BERT4Rec + RAG Recommender pipeline as the evaluation layer
Techniques applied in production at Inokufu to validate RAG-generated educational recommendations