Data Ingestion & Chunking

Chapter 2 of 8

Explorer~15 min

You can’t feed a library into an AI all at once. Chunking is how you cut it into pieces the AI can actually use.

After this chapter, you’ll be able to: Split any document into well-sized chunks and understand the tradeoffs of each strategy.

Why Can’t You Just Paste the Whole Document?

Every LLM has a context window — the maximum amount of text it can process at once. Think of it as the AI’s desk: it can only look at so many pages at the same time.

Even models with large context windows (100k+ tokens) have problems with very long inputs:

Cost — you pay per token. Sending an entire book for every question is expensive.
Noise — the more irrelevant text you include, the worse the answer gets. The AI gets distracted by content that isn’t related to the question.
The “lost in the middle” problem — research shows that LLMs pay less attention to information in the middle of long contexts. [src: liu2023lost]

The solution: split your documents into smaller, self-contained pieces called chunks. Then, at query time, only send the relevant chunks — not the entire document.

PLAIN ENGLISH

Chunking is slicing a big document into small, self-contained pieces so your AI only reads what is relevant to the question.

What Makes a Good Chunk?

A good chunk has two properties:

Self-contained — it makes sense on its own, without needing the surrounding text
Focused — it’s about one idea, so it can be accurately matched to a relevant question

A bad chunk is either too small (a sentence fragment with no context) or too large (three different topics crammed together).

The Four Chunking Strategies

1. Fixed-Size Chunking

The simplest approach: cut every N characters, regardless of content.

How it works: Set a chunk size (e.g., 500 characters). Split the text at every 500-character mark.

Pros:

✅ Fastest setup
✅ Predictable chunk count
✅ Easy to debug

Cons:

⚠️ Can cut sentences in half
⚠️ Splits ideas mid-thought
⚠️ Ignores semantic boundaries

2. Sentence-Based Chunking

Split at sentence boundaries instead of arbitrary positions.

How it works: Group sentences together until you hit the size limit. Start a new chunk at the next sentence.

Pros:

✅ Preserves sentence boundaries
✅ More natural chunk flow

Cons:

⚠️ Chunk sizes are less uniform
⚠️ Long sentences can create oversized chunks

3. Sliding Window (Overlap)

The key insight: when you cut between two chunks, the idea at the boundary gets split. Overlap fixes this by sharing content between adjacent chunks.

How it works: Same as fixed-size, but each chunk starts N characters before the previous one ended. Those N characters appear in both chunks.

Pros:

✅ Reduces boundary information loss
✅ Strong default for production systems [src: langchain_docs]

Cons:

⚠️ Increases chunk count and embedding cost
⚠️ Adds some redundancy

4. Semantic Chunking

The smartest (and hardest) approach: split where the topic changes.

How it works: Use embeddings to measure how similar adjacent sentences are. When similarity drops sharply, that’s a topic boundary — split there.

Pros:

✅ Aligns chunks to real topic boundaries
✅ Highest retrieval quality in many datasets

Cons:

⚠️ Slower ingestion
⚠️ More implementation complexity
⚠️ Requires embedding-powered splitting

Loading diagram...

Four ways to split the same document. Each creates different chunks.

Chunk Size: The Goldilocks Problem

Too small (< 100 chars): Chunks lack context. “The answer is 42” means nothing without the question.
Too large (> 2000 chars): Chunks contain multiple topics. Searching for “refund policy” returns a chunk that’s 80% about shipping.
Just right (200–800 chars): Chunks are self-contained and focused. A good rule of thumb is 200-500 tokens. [src: langchain_docs]

The overlap should typically be 10–20% of the chunk size. So if your chunks are 500 characters, use 50-100 characters of overlap.

WATCH OUT

Choosing a chunk size that is too small is the most common beginner mistake. If your chunks are under 100 characters, retrieval quality drops sharply because each chunk lacks enough context.

Try It Yourself

Paste any text below and experiment with different strategies and sizes. Watch how the chunks change.

Try It: Live Text Splitter

Retrieval-Augmented Generation (RAG) is a technique that combines the power of large language models with external knowledge retrieval. Instead of relying solely on what the model learned during training, RAG systems first search through a collection of documents to find relevant information, then use that information to generate more accurate and up-to-date responses.

The key insight behind RAG is simple: language models are great at understanding and generating text, but they have a fixed knowledge cutoff and can hallucinate facts. By giving the model access to external documents at query time, we get the best of both worlds — the model's language understanding plus real, verifiable information.

RAG has become the most popular approach for building AI applications that need access to specific knowledge bases, such as customer support bots, internal documentation search, and question-answering systems over private data. It's simpler and cheaper than fine-tuning, and the knowledge base can be updated without retraining the model.

Strategy

Chunk Size: 200 chars

Overlap: 40 chars

7 chunks | ~322 tokens total

Retrieval-Augmented Generation (RAG) is a technique that combines the power of large language models with external knowledge retrieval. Instead of relying solely on what the model learned during trainy on what the model learned during training, RAG systems first search through a collection of documents to find relevant information, then use that information to generate more accurate and up-to-dateto generate more accurate and up-to-date responses.

The key insight behind RAG is simple: language models are great at understanding and generating text, but they have a fixed knowledge cutoff and cahey have a fixed knowledge cutoff and can hallucinate facts. By giving the model access to external documents at query time, we get the best of both worlds — the model's language understanding plus ree model's language understanding plus real, verifiable information.

RAG has become the most popular approach for building AI applications that need access to specific knowledge bases, such as customepecific knowledge bases, such as customer support bots, internal documentation search, and question-answering systems over private data. It's simpler and cheaper than fine-tuning, and the knowledge bar than fine-tuning, and the knowledge base can be updated without retraining the model.

Chunk 1 • 200 chars • ~50 tokens

Chunk 2 • 200 chars • ~50 tokens

y on what the model learned during training, RAG systems first search through a collection of documents to find relevant information, then use that information to generate more accurate and up-to-date

Chunk 3 • 200 chars • ~50 tokens

to generate more accurate and up-to-date responses. The key insight behind RAG is simple: language models are great at understanding and generating text, but they have a fixed knowledge cutoff and ca

Chunk 4 • 200 chars • ~50 tokens

hey have a fixed knowledge cutoff and can hallucinate facts. By giving the model access to external documents at query time, we get the best of both worlds — the model's language understanding plus re

Chunk 5 • 200 chars • ~50 tokens

e model's language understanding plus real, verifiable information. RAG has become the most popular approach for building AI applications that need access to specific knowledge bases, such as custome

Chunk 6 • 200 chars • ~50 tokens

pecific knowledge bases, such as customer support bots, internal documentation search, and question-answering systems over private data. It's simpler and cheaper than fine-tuning, and the knowledge ba

Chunk 7 • 87 chars • ~22 tokens

r than fine-tuning, and the knowledge base can be updated without retraining the model.

Which Strategy Should You Use?

Here’s a practical decision framework based on what most production teams actually use:

Situation	Recommended Strategy
Getting started, need something working today	Fixed-size with overlap
Your docs are well-structured prose (articles, reports)	Sentence-based
Your docs have mixed content (tables, lists, narrative)	Sliding window with overlap
You have time to invest and need maximum retrieval quality	Semantic chunking
Code files or structured data	Fixed-size by logical unit (function, class)

The practical default for 90% of production systems: sliding window chunking with 512 tokens and 10–20% overlap. It’s fast, predictable, and good enough unless you have specific reasons to do otherwise. [src: langchain_docs]

Chunk Size in Practice: Real Numbers

Here are the numbers that actually work in production, not theoretical optima:

Use Case	Chunk Size	Overlap	Reasoning
Dense technical docs (API reference)	256–400 tokens	10%	Each section is short and specific
Long-form articles and reports	500–800 tokens	15%	Need context around each topic
Conversational transcripts	300–500 tokens	20%	Speaker turns need surrounding context
Legal or compliance documents	400–600 tokens	20%	Claims span multiple sentences
Code files	By function/class	0%	Logical units, not arbitrary splits

Note: “tokens” ≠ “characters”. On average, 1 token ≈ 4 characters in English. A 512-token chunk is roughly 2,000 characters or 300–350 words.

TIP

When in doubt, start with 512 tokens and 50-token overlap. Run your actual queries against it. If you get irrelevant results, try smaller chunks. If chunks lack context, try larger ones. Tune on real data, not theory.

Metadata: The Hidden Superpower

When you create chunks, you should attach metadata — extra information like:

Source filename — which document this chunk came from
Page number — where in the document
Section title — which heading it falls under
Creation date — when the source was written

This metadata is critical later. When your RAG system retrieves a chunk, metadata lets you cite the source: “According to enterprise_terms_v4.md, Section 7.2…” Without metadata, you have a chunk with no provenance — and no way to tell the user where the information came from.

What You Just Built

In this chapter, you learned how to take raw text and turn it into searchable pieces. In the Playground, you can now:

Upload or paste a document
Choose a chunking strategy
Adjust size and overlap
See your chunks ready for the next step

Next up: those chunks are just text. To search them by meaning (not just keywords), we need to convert them into numbers. That’s what embeddings do.

Quick Check

Why is overlap used in sliding window chunking?

What happens when chunks are too small?

Was this chapter helpful?

Sources:

LangChain chunking documentation — TextSplitter reference
LlamaIndex node parsers documentation
Liu et al. (2023) — “Lost in the Middle: How Language Models Use Long Contexts”