Skip to content

The Prompt Layer

Chapter 6 of 8
Engineer~13 min

Retrieved chunks are just raw ingredients. The prompt is the recipe that tells the AI what to cook.

Section titled “Retrieved chunks are just raw ingredients. The prompt is the recipe that tells the AI what to cook.”

After this chapter, you will be able to: craft a RAG prompt template that turns retrieved chunks into grounded, cited answers — and understand why prompt structure matters as much as retrieval quality.


The Problem with “Just Give It the Chunks”

Section titled “The Problem with “Just Give It the Chunks””

You have spent the last five chapters building a pipeline that finds the right pieces of information. You have chunks. You have embeddings. You have retrieval scores. Now what?

The tempting answer is: paste the chunks into the AI and ask your question. Done, right?

Not even close. Without clear instructions, an LLM will do unpredictable things with your retrieved context. Sometimes it ignores the chunks entirely and answers from its training data. Sometimes it blends chunk content with made-up facts. Sometimes it uses the chunks but presents the information as if it knew it all along — no citations, no source attribution, no way for the user to verify anything.

The prompt is where you take control. It is the difference between a demo that sometimes works and a system you would actually trust.


Every RAG prompt has three sections, assembled in a specific order before being sent to the model.

Loading diagram...

Section 1 — The System Prompt sets the rules. It tells the AI who it is, how it should behave, and what it must never do. This is where you say things like: “Only answer using the provided context. If the context does not contain the answer, say so.” This section stays the same for every query.

Section 2 — The Retrieved Context contains the chunks your retrieval pipeline found. These are the specific pieces of information the AI should use to answer. You format them clearly — usually numbered, with source metadata attached — so the model can reference them.

Section 3 — The User Query is the actual question being asked. It goes last, right before the model generates its response.

The order matters. The system prompt establishes the rules before the model sees any content. The context provides the evidence. The query tells the model what to do with that evidence. [src: anthropic_prompt_docs]

PLAIN ENGLISH
A RAG prompt is three parts in order: instructions (system prompt), evidence (retrieved chunks), and the question (user query). The AI reads the rules before it sees the content.

Here is something most tutorials skip: LLMs do not pay equal attention to every part of a long prompt. Research by Liu et al. (2023) found that models perform best when the most relevant information appears at the beginning or end of the context window. Information placed in the middle gets less attention — the model is more likely to miss it or give it less weight. [src: liu2023lost]

Think of it like reading a long email. You remember the opening line and the last thing you read. The stuff in the middle? It blurs together.

This has a direct practical consequence for how you order your retrieved chunks:

  • Put your highest-scoring chunks first. The model pays the most attention to the beginning of the context.
  • Put the second-best chunks last. The model also pays strong attention to the end.
  • Put lower-relevance chunks in the middle. If they get partially ignored, you lose the least.

This is not a theory. It is a measured effect that impacts answer quality in production systems. If you have five retrieved chunks ranked by relevance, order them: 1, 3, 5, 4, 2 — best at the start, second-best at the end. [src: liu2023lost]


Every LLM has a context window — a maximum number of tokens it can handle in a single request. Everything you send — system prompt, retrieved chunks, user query — plus everything the model generates in response, must fit inside that window.

Here is the math for a model with an 8,000-token context window:

ComponentTypical Token Count
System prompt~200–500 tokens
Each retrieved chunk~150–300 tokens
User query~20–50 tokens
Reserved for model response~500–1,000 tokens
Available for chunks~6,000–7,000 tokens

If each chunk averages 200 tokens, you could technically fit 30–35 chunks. But you should not. More chunks means more noise. The model has to sort through more content to find what matters, and the lost-in-the-middle effect gets worse.

The practical rule: aim for 3 to 5 high-quality chunks. That is around 600–1,000 tokens of context — plenty for the model to give a grounded answer, little enough to avoid dilution. If your retrieval is good, 3 chunks is often better than 10.

WATCH OUT
More chunks is not better. Stuffing 20 chunks into the context dilutes relevance, worsens the lost-in-the-middle effect, and eats up tokens you need for the model’s response.

Citation Formatting — Making the AI Show Its Sources

Section titled “Citation Formatting — Making the AI Show Its Sources”

One of the most important things your prompt can do is force the model to cite where its answer came from. Without citation instructions, the AI presents retrieved information as if it knew it all along. The user has no way to verify anything.

The fix is simple: tell the model to cite, and give it something to cite. Number your chunks in the context section, then instruct the model to reference those numbers.

Here is what that looks like in practice:

CONTEXT:
[1] (meeting-notes-jan.md) The project deadline was moved to March 15th.
[2] (budget-2024.xlsx) Q1 budget was approved at $45,000.
[3] (meeting-notes-feb.md) The team agreed to hire two contractors.
INSTRUCTIONS: Answer the user's question using ONLY the context above.
Cite your sources using [1], [2], etc.

With this structure, the model generates responses like: “The project deadline is March 15th [1], and the team plans to hire two contractors to meet it [3].” The user can verify each claim by checking the referenced source.


You do not need to write a new prompt from scratch for every application. Most RAG use cases fall into one of three patterns.

Template 1 — Question & Answer Over Documents

Section titled “Template 1 — Question & Answer Over Documents”

This is the most common pattern. The user asks a question, and the system answers strictly from the provided documents.

SYSTEM: You are a helpful assistant that answers questions based
on the provided documents. Use ONLY the information in the CONTEXT
section below. If the answer is not in the context, say "I don't
have enough information to answer that." Always cite your sources
using the document labels provided.
CONTEXT:
{retrieved_chunks}
USER: {user_question}

The user wants a summary of information across multiple documents, with clear attribution.

SYSTEM: You are a research assistant. Summarise the key points
from the documents below. Organise your summary by topic. Cite
each point with the source document label. Do not add information
that is not in the provided documents.
CONTEXT:
{retrieved_chunks}
USER: Summarise what these documents say about {topic}.

Template 3 — Conversational Chatbot with Memory

Section titled “Template 3 — Conversational Chatbot with Memory”

For multi-turn conversations, you include the conversation history as an additional section.

SYSTEM: You are a helpful chatbot that answers questions about
the user's documents. Use the CONTEXT to answer. If you used
information from a previous turn and it is still relevant, you
may reference it. Always cite new information from CONTEXT.
CONVERSATION HISTORY:
{previous_turns}
CONTEXT:
{retrieved_chunks}
USER: {user_question}

The key difference in the conversational template is the conversation history section. This lets the model understand follow-up questions like “What about the second point?” without losing track of what was discussed before.


Build your own RAG prompt interactively. Write a system prompt, add retrieved chunks, enter a query, and see how the final assembled prompt looks — complete with token counts per section.

Try It: The Prompt Builder

~68 tokens
~139 tokens
[Chunk 1] ChromaDB stores embeddings using DuckDB as its default backend. It supports both in-memory and persistent storage modes. For persistent storage, data is saved to a local directory.
[Chunk 2] Qdrant is a vector database built in Rust that supports both dense and sparse vectors. It offers a free cloud tier with 1GB of storage and built-in hybrid search capabilities.
[Chunk 3] FAISS (Facebook AI Similarity Search) is a library for efficient similarity search. Unlike ChromaDB and Qdrant, it's not a database — it's a library that requires you to handle persistence yourself.

Tip: LLMs pay more attention to chunks at the beginning and end. Try putting your most important chunk first or last.

~15 tokens
~242 total tokens
SYSTEM: You are a helpful assistant that answers questions based on the provided context.
Only use information from the context below. If the context doesn't contain the answer, say "I don't have enough information to answer that."
Always cite which chunk your answer is based on.

---
CONTEXT:
[Chunk 1]: ChromaDB stores embeddings using DuckDB as its default backend. It supports both in-memory and persistent storage modes. For persistent storage, data is saved to a local directory.

[Chunk 2]: Qdrant is a vector database built in Rust that supports both dense and sparse vectors. It offers a free cloud tier with 1GB of storage and built-in hybrid search capabilities.

[Chunk 3]: FAISS (Facebook AI Similarity Search) is a library for efficient similarity search. Unlike ChromaDB and Qdrant, it's not a database — it's a library that requires you to handle persistence yourself.
---

USER QUESTION: Which vector database should I use for a quick prototype?

Experiment with reordering chunks. Move your best chunk to the middle — then move it back to the top. Watch how the structure changes. In a real system, this reordering can measurably affect answer quality.


Open the Playground and craft the prompt template your chatbot will use. Start with Template 1 (Q&A over documents). Set the system prompt to instruct the model to answer only from your notes and always cite the source chunk. Run a test query and check: does the answer come from your chunks? Does it cite them? If not, adjust your system prompt until it does.


Q1

What are the three main sections of a RAG prompt, in the order they are assembled?

Q2

According to the 'lost in the middle' research, where should you place your most relevant retrieved chunks?


You now have the third critical layer of your RAG pipeline. Retrieval finds the right information. The prompt tells the AI exactly how to use it. You know how to structure a prompt with system instructions, retrieved context, and the user query. You know why chunk ordering matters and how to force the model to cite its sources. You have a working prompt template for your chatbot.

But here is the question you cannot yet answer: is your chatbot actually giving good answers? “It seems to work” is not a metric. Next, you will learn how to measure quality — and how to diagnose exactly what is wrong when the answers fall short.


Was this chapter helpful?



← Chapter 5: Retrieval Strategies | Chapter 7: Evaluation →