Lab 4: Evaluate with RAGAS
You built a RAG pipeline. It returns answers. But here is the uncomfortable question: are the answers actually good?
“It seems to work” is not a metric. If you shipped a web app and your only test was “I clicked around and it seemed fine,” you would get fired. RAG pipelines deserve the same rigour.
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that scores your pipeline on the dimensions that matter. This lab walks you through evaluating a real pipeline, interpreting the scores, and knowing exactly what to fix when scores are low.
Prerequisites
Section titled “Prerequisites”Before starting this lab, you need:
- Python 3.9+ installed
- pip package manager
- A working RAG pipeline from Lab 1 (or any pipeline that can produce a question, retrieved contexts, and a generated answer)
- An OpenAI API key (RAGAS uses an LLM as a judge — a small number of calls, typically under $0.10 for this lab)
Step 1: Install RAGAS
Section titled “Step 1: Install RAGAS”Install the RAGAS library and its dependencies:
pip install ragas datasetsRAGAS depends on langchain and openai under the hood. If you completed Lab 1, you already have those installed. If not, add them:
pip install langchain openaiSet your OpenAI API key as an environment variable:
export OPENAI_API_KEY="sk-your-key-here"Verify the installation:
import ragasprint(ragas.__version__)You should see a version number printed. You are ready to go.
Step 2: Create a Test Dataset
Section titled “Step 2: Create a Test Dataset”RAGAS evaluates your pipeline on a set of test cases. Each test case is a dictionary with four fields:
| Field | What It Is | Example |
|---|---|---|
question | The user’s query | ”What is ChromaDB?” |
answer | The answer your pipeline generated | ”ChromaDB is a vector database…” |
contexts | The chunks your retriever returned (list of strings) | [“ChromaDB stores embeddings…”] |
ground_truth | The correct answer (written by a human) | “ChromaDB is an open-source embedding database…” |
Here is a test dataset with 6 examples. In a real project, you would create 20 to 50 of these. For this lab, 6 is enough to see how scoring works:
from datasets import Dataset
test_data = { "question": [ "What is ChromaDB and how does it store data?", "How does cosine similarity work?", "What is the difference between RAG and fine-tuning?", "What embedding models are available for free?", "How does HNSW indexing work?", "What is hybrid search?", ], "answer": [ "ChromaDB is a vector database that stores embeddings using DuckDB as its backend. It supports both in-memory and persistent storage modes, making it easy to prototype locally.", "Cosine similarity measures the angle between two vectors. If the angle is small, the vectors point in roughly the same direction, meaning the sentences are semantically similar. A score of 1.0 means identical direction.", "RAG retrieves relevant documents at query time and feeds them to the LLM as context. Fine-tuning modifies the model's weights by training on your data. RAG is cheaper and easier to update. Fine-tuning gives you more control over the model's style and behaviour.", "There are several free embedding models you can use. Hugging Face hosts models like BAAI/bge-small-en and sentence-transformers/all-MiniLM-L6-v2. These run locally and produce high-quality vectors for most use cases.", "HNSW stands for Hierarchical Navigable Small World. It builds a multi-layer graph where the top layers have long-range connections for fast navigation and the bottom layers have short-range connections for precision. It runs in approximately O(log n) time.", "Hybrid search combines semantic search with keyword search. The semantic component finds chunks that are similar in meaning using vector similarity. The keyword component finds exact term matches using algorithms like BM25. The scores are combined with configurable weights.", ], "contexts": [ ["ChromaDB stores embeddings using DuckDB as its default backend. It supports both in-memory and persistent storage modes. For persistent storage, data is saved to a local directory."], ["Cosine similarity measures the angle between two vectors. A score of 1.0 means identical direction, meaning the sentences have the same meaning. A score of 0 means the vectors are orthogonal — completely unrelated."], ["Retrieval-Augmented Generation combines search with generation. Instead of memorising facts, the model looks them up at query time.", "Fine-tuning allows you to customise a model's behaviour by training it on your own data. This modifies the model weights permanently."], ["Hugging Face hosts thousands of free embedding models. Popular choices include BAAI/bge-small-en (384 dimensions) and sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)."], ["HNSW (Hierarchical Navigable Small World) is a graph-based index structure. It creates multiple layers: upper layers for fast coarse navigation and lower layers for precise nearest-neighbour search. Lookup time is approximately O(log n)."], ["Hybrid search combines dense vector search (semantic) with sparse keyword search (BM25). This catches both meaning-based and exact-term matches. Weights are typically 60% semantic, 40% keyword."], ], "ground_truth": [ "ChromaDB is an open-source embedding database that uses DuckDB as its default storage backend. It supports in-memory and persistent storage modes.", "Cosine similarity measures the cosine of the angle between two vectors. Values range from -1 to 1, where 1 means identical direction (same meaning) and 0 means no relation.", "RAG retrieves external documents at inference time and provides them as context. Fine-tuning trains the model on custom data, modifying its weights. RAG is easier to update and cheaper, while fine-tuning offers more style control.", "Free embedding models include BAAI/bge-small-en and sentence-transformers/all-MiniLM-L6-v2, both available on Hugging Face and runnable locally.", "HNSW is a multi-layer graph index. Upper layers provide fast approximate search via long-range connections. Lower layers provide precise results. Runs in O(log n) time.", "Hybrid search combines semantic vector search with keyword-based BM25 search to capture both meaning and exact term matches.", ],}
dataset = Dataset.from_dict(test_data)print(f"Test dataset created with {len(dataset)} examples")Step 3: Run RAGAS Evaluation
Section titled “Step 3: Run RAGAS Evaluation”Now run the evaluation. RAGAS will score each test case on four metrics:
from ragas import evaluatefrom ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall,)
# Run evaluationresult = evaluate( dataset=dataset, metrics=[ faithfulness, answer_relevancy, context_precision, context_recall, ],)
# Print overall scoresprint(result)This takes 1 to 3 minutes depending on your dataset size. RAGAS makes LLM calls to judge each metric — this is why you need an API key.
The output looks like this:
{ 'faithfulness': 0.8833, 'answer_relevancy': 0.9127, 'context_precision': 0.8500, 'context_recall': 0.7917}To see per-question scores (which is where the real insights are):
df = result.to_pandas()print(df.to_string())This gives you a row per test case with individual scores. This is the most useful view. Overall averages hide problems. A pipeline with 0.85 average faithfulness might have three questions at 1.0 and one at 0.4 — that one at 0.4 is where your pipeline is hallucinating.
Step 4: Interpret the Scores
Section titled “Step 4: Interpret the Scores”Here is what each metric measures and what the scores mean:
Faithfulness (Is the answer grounded in the context?)
Section titled “Faithfulness (Is the answer grounded in the context?)”| Score Range | Meaning |
|---|---|
| 0.9 – 1.0 | Excellent. Every claim in the answer comes from the retrieved chunks. |
| 0.7 – 0.9 | Good. Most claims are grounded, but some may be inferred or added by the LLM. |
| Below 0.7 | Problem. The LLM is adding information not present in the context — hallucinating. |
Answer Relevancy (Does the answer address the question?)
Section titled “Answer Relevancy (Does the answer address the question?)”| Score Range | Meaning |
|---|---|
| 0.9 – 1.0 | The answer directly and completely addresses what was asked. |
| 0.7 – 0.9 | The answer is mostly on topic but may include extra or tangential information. |
| Below 0.7 | The answer misses the point of the question or goes off topic. |
Context Precision (Are the retrieved chunks relevant?)
Section titled “Context Precision (Are the retrieved chunks relevant?)”| Score Range | Meaning |
|---|---|
| 0.9 – 1.0 | Every retrieved chunk is relevant to the question. Your retriever is sharp. |
| 0.7 – 0.9 | Most chunks are relevant, but some noise is present. |
| Below 0.7 | Your retriever is pulling in irrelevant chunks. This hurts the LLM’s ability to answer. |
Context Recall (Did you retrieve everything you needed?)
Section titled “Context Recall (Did you retrieve everything you needed?)”| Score Range | Meaning |
|---|---|
| 0.9 – 1.0 | The retrieved context covers all the information needed to answer correctly. |
| 0.7 – 0.9 | Most of the needed information was retrieved, but some was missed. |
| Below 0.7 | Important information is missing from the retrieved context. The answer will be incomplete. |
Step 5: Diagnose and Improve
Section titled “Step 5: Diagnose and Improve”This is where RAGAS becomes truly useful. Each low score points to a specific component in your pipeline that needs fixing.
If Faithfulness Is Low
Section titled “If Faithfulness Is Low”The problem: Your LLM is making things up instead of sticking to the provided context.
What to fix:
- Strengthen your system prompt: add explicit instructions like “Only use information from the provided context. If the context does not contain the answer, say so.”
- Reduce temperature to 0.0 or 0.1 — less creative, more faithful
- Check if your chunks are too short. If a chunk is only a sentence fragment, the LLM may fill in gaps with invented information.
# Example: tighter system promptsystem_prompt = """Answer the question using ONLY the context provided below.If the context does not contain enough information, respond with:"I don't have enough information to answer this question."Do NOT add any information beyond what is in the context."""If Answer Relevancy Is Low
Section titled “If Answer Relevancy Is Low”The problem: The answer is going off-topic or not addressing what was asked.
What to fix:
- Check if the retrieved chunks are on-topic (look at context precision)
- If chunks are relevant but the answer is not, your prompt may need work — explicitly instruct the LLM to answer the specific question asked
- Try adding the query at both the beginning and end of your prompt (the “lost in the middle” pattern)
If Context Precision Is Low
Section titled “If Context Precision Is Low”The problem: Your retriever is pulling in irrelevant chunks that dilute the good ones.
What to fix:
- Add a re-ranker (see Lab 3) to filter out noise after initial retrieval
- Reduce top-K from 5 to 3 — fewer but more relevant chunks
- Try hybrid search instead of pure semantic search
- Add metadata filtering if your documents span multiple topics
# Example: reduce top-K and add a relevance thresholdresults = vector_store.similarity_search_with_score(query, k=10)# Filter by minimum similarity scorefiltered = [(doc, score) for doc, score in results if score > 0.7]# Take only top 3top_results = filtered[:3]If Context Recall Is Low
Section titled “If Context Recall Is Low”The problem: Important information exists in your corpus but your retriever is not finding it.
What to fix:
- Your chunks may be too large (burying key sentences in long paragraphs) or too small (splitting important context across chunks)
- Try increasing chunk overlap so key sentences appear in multiple chunks
- Try a different embedding model — some models handle your domain better than others
- Add more documents to your corpus if the information is genuinely missing
The Improvement Loop
Section titled “The Improvement Loop”Run evaluation, fix the weakest metric, run evaluation again. Repeat. This is the loop:
Score → Find lowest metric → Identify the component → Fix it → Re-scoreKeep a log of your changes and scores:
# Simple evaluation logimport jsonfrom datetime import datetime
log_entry = { "timestamp": datetime.now().isoformat(), "changes": "Added re-ranker, reduced top-K from 5 to 3", "scores": { "faithfulness": 0.92, "answer_relevancy": 0.91, "context_precision": 0.88, "context_recall": 0.85, }}
with open("eval_log.jsonl", "a") as f: f.write(json.dumps(log_entry) + "\n")What You Built
Section titled “What You Built”In this lab you:
- Installed RAGAS and set up an evaluation environment
- Created a test dataset with questions, answers, contexts, and ground truths
- Ran a full evaluation across four metrics
- Learned to interpret scores and understand what each one tells you about your pipeline
- Built a diagnostic framework — when a score is low, you now know exactly which component to fix
This is the difference between a demo and a product. Demos work when you try them. Products work when your users try them. Evaluation is how you bridge that gap.
Sources:
- Shahul Es et al. (2023) — “RAGAS: Automated Evaluation of Retrieval Augmented Generation” (paper)
- RAGAS Documentation
- RAGAS GitHub Repository