Skip to content

Data Ingestion & Chunking

Chapter 2 of 8
Explorer~15 min

You can’t feed a library into an AI all at once. Chunking is how you cut it into pieces the AI can actually use.

After this chapter, you’ll be able to: Split any document into well-sized chunks and understand the tradeoffs of each strategy.


Why Can’t You Just Paste the Whole Document?

Section titled “Why Can’t You Just Paste the Whole Document?”

Every LLM has a context window — the maximum amount of text it can process at once. Think of it as the AI’s desk: it can only look at so many pages at the same time.

Even models with large context windows (100k+ tokens) have problems with very long inputs:

  • Cost — you pay per token. Sending an entire book for every question is expensive.
  • Noise — the more irrelevant text you include, the worse the answer gets. The AI gets distracted by content that isn’t related to the question.
  • The “lost in the middle” problem — research shows that LLMs pay less attention to information in the middle of long contexts. [src: liu2023lost]

The solution: split your documents into smaller, self-contained pieces called chunks. Then, at query time, only send the relevant chunks — not the entire document.

PLAIN ENGLISH
Chunking is slicing a big document into small, self-contained pieces so your AI only reads what is relevant to the question.

A good chunk has two properties:

  1. Self-contained — it makes sense on its own, without needing the surrounding text
  2. Focused — it’s about one idea, so it can be accurately matched to a relevant question

A bad chunk is either too small (a sentence fragment with no context) or too large (three different topics crammed together).


The simplest approach: cut every N characters, regardless of content.

How it works: Set a chunk size (e.g., 500 characters). Split the text at every 500-character mark.

Pros:

  • ✅ Fastest setup
  • ✅ Predictable chunk count
  • ✅ Easy to debug

Cons:

  • ⚠️ Can cut sentences in half
  • ⚠️ Splits ideas mid-thought
  • ⚠️ Ignores semantic boundaries

Split at sentence boundaries instead of arbitrary positions.

How it works: Group sentences together until you hit the size limit. Start a new chunk at the next sentence.

Pros:

  • ✅ Preserves sentence boundaries
  • ✅ More natural chunk flow

Cons:

  • ⚠️ Chunk sizes are less uniform
  • ⚠️ Long sentences can create oversized chunks

The key insight: when you cut between two chunks, the idea at the boundary gets split. Overlap fixes this by sharing content between adjacent chunks.

How it works: Same as fixed-size, but each chunk starts N characters before the previous one ended. Those N characters appear in both chunks.

Pros:

  • ✅ Reduces boundary information loss
  • ✅ Strong default for production systems [src: langchain_docs]

Cons:

  • ⚠️ Increases chunk count and embedding cost
  • ⚠️ Adds some redundancy

The smartest (and hardest) approach: split where the topic changes.

How it works: Use embeddings to measure how similar adjacent sentences are. When similarity drops sharply, that’s a topic boundary — split there.

Pros:

  • ✅ Aligns chunks to real topic boundaries
  • ✅ Highest retrieval quality in many datasets

Cons:

  • ⚠️ Slower ingestion
  • ⚠️ More implementation complexity
  • ⚠️ Requires embedding-powered splitting
Loading diagram...
Four ways to split the same document. Each creates different chunks.

  • Too small (< 100 chars): Chunks lack context. “The answer is 42” means nothing without the question.
  • Too large (> 2000 chars): Chunks contain multiple topics. Searching for “refund policy” returns a chunk that’s 80% about shipping.
  • Just right (200–800 chars): Chunks are self-contained and focused. A good rule of thumb is 200-500 tokens. [src: langchain_docs]

The overlap should typically be 10–20% of the chunk size. So if your chunks are 500 characters, use 50-100 characters of overlap.

WATCH OUT
Choosing a chunk size that is too small is the most common beginner mistake. If your chunks are under 100 characters, retrieval quality drops sharply because each chunk lacks enough context.

Paste any text below and experiment with different strategies and sizes. Watch how the chunks change.

Try It: Live Text Splitter

7 chunks | ~322 tokens total
Retrieval-Augmented Generation (RAG) is a technique that combines the power of large language models with external knowledge retrieval. Instead of relying solely on what the model learned during trainy on what the model learned during training, RAG systems first search through a collection of documents to find relevant information, then use that information to generate more accurate and up-to-dateto generate more accurate and up-to-date responses. The key insight behind RAG is simple: language models are great at understanding and generating text, but they have a fixed knowledge cutoff and cahey have a fixed knowledge cutoff and can hallucinate facts. By giving the model access to external documents at query time, we get the best of both worlds — the model's language understanding plus ree model's language understanding plus real, verifiable information. RAG has become the most popular approach for building AI applications that need access to specific knowledge bases, such as customepecific knowledge bases, such as customer support bots, internal documentation search, and question-answering systems over private data. It's simpler and cheaper than fine-tuning, and the knowledge bar than fine-tuning, and the knowledge base can be updated without retraining the model.
Chunk 1200 chars • ~50 tokens
Retrieval-Augmented Generation (RAG) is a technique that combines the power of large language models with external knowledge retrieval. Instead of relying solely on what the model learned during train
Chunk 2200 chars • ~50 tokens
y on what the model learned during training, RAG systems first search through a collection of documents to find relevant information, then use that information to generate more accurate and up-to-date
Chunk 3200 chars • ~50 tokens
to generate more accurate and up-to-date responses. The key insight behind RAG is simple: language models are great at understanding and generating text, but they have a fixed knowledge cutoff and ca
Chunk 4200 chars • ~50 tokens
hey have a fixed knowledge cutoff and can hallucinate facts. By giving the model access to external documents at query time, we get the best of both worlds — the model's language understanding plus re
Chunk 5200 chars • ~50 tokens
e model's language understanding plus real, verifiable information. RAG has become the most popular approach for building AI applications that need access to specific knowledge bases, such as custome
Chunk 6200 chars • ~50 tokens
pecific knowledge bases, such as customer support bots, internal documentation search, and question-answering systems over private data. It's simpler and cheaper than fine-tuning, and the knowledge ba
Chunk 787 chars • ~22 tokens
r than fine-tuning, and the knowledge base can be updated without retraining the model.

Here’s a practical decision framework based on what most production teams actually use:

SituationRecommended Strategy
Getting started, need something working todayFixed-size with overlap
Your docs are well-structured prose (articles, reports)Sentence-based
Your docs have mixed content (tables, lists, narrative)Sliding window with overlap
You have time to invest and need maximum retrieval qualitySemantic chunking
Code files or structured dataFixed-size by logical unit (function, class)

The practical default for 90% of production systems: sliding window chunking with 512 tokens and 10–20% overlap. It’s fast, predictable, and good enough unless you have specific reasons to do otherwise. [src: langchain_docs]


Here are the numbers that actually work in production, not theoretical optima:

Use CaseChunk SizeOverlapReasoning
Dense technical docs (API reference)256–400 tokens10%Each section is short and specific
Long-form articles and reports500–800 tokens15%Need context around each topic
Conversational transcripts300–500 tokens20%Speaker turns need surrounding context
Legal or compliance documents400–600 tokens20%Claims span multiple sentences
Code filesBy function/class0%Logical units, not arbitrary splits

Note: “tokens” ≠ “characters”. On average, 1 token ≈ 4 characters in English. A 512-token chunk is roughly 2,000 characters or 300–350 words.

TIP
When in doubt, start with 512 tokens and 50-token overlap. Run your actual queries against it. If you get irrelevant results, try smaller chunks. If chunks lack context, try larger ones. Tune on real data, not theory.

When you create chunks, you should attach metadata — extra information like:

  • Source filename — which document this chunk came from
  • Page number — where in the document
  • Section title — which heading it falls under
  • Creation date — when the source was written

This metadata is critical later. When your RAG system retrieves a chunk, metadata lets you cite the source: “According to enterprise_terms_v4.md, Section 7.2…” Without metadata, you have a chunk with no provenance — and no way to tell the user where the information came from.


In this chapter, you learned how to take raw text and turn it into searchable pieces. In the Playground, you can now:

  1. Upload or paste a document
  2. Choose a chunking strategy
  3. Adjust size and overlap
  4. See your chunks ready for the next step

Next up: those chunks are just text. To search them by meaning (not just keywords), we need to convert them into numbers. That’s what embeddings do.


Q1

Why is overlap used in sliding window chunking?

Q2

What happens when chunks are too small?


Was this chapter helpful?


Sources: