Chunking Strategies
Before you can search your documents, you need to split them into chunks. The size and overlap of those chunks dramatically affects retrieval quality.
Why chunk at all? Embedding models have token limits (typically 512-8192 tokens). A token is roughly 3/4 of a word -- so 100 tokens is about 75 words. A 50-page document won't fit into one embedding. We split it into smaller pieces, embed each piece separately, then search across all chunks. The art is choosing the right chunk size and overlap.
Rules of thumb: Start with 200-500 token chunks and 10-20% overlap. Good overlap means repeating the last 10-20% of each chunk at the beginning of the next one -- for example, a 100-word chunk with 15-word overlap. This ensures sentences that fall on the boundary between two chunks are not lost. For technical docs, use larger chunks. For Q&A, use smaller chunks. Always test with real queries -- the "best" chunk size depends on your data and questions.
Chunk Size Tradeoffs
Small Chunks (50-200 words)
More precise retrieval. Better for specific factual questions. Faster embedding. But may lose context needed to understand the passage.
Large Chunks (200-500 words)
More context preserved. Better for complex questions requiring reasoning. But may include irrelevant info that confuses the LLM.
Too Small (<50 words)
Chunks become meaningless fragments. "The cat sat on" tells the LLM nothing useful. Retrieval becomes noise.
Too Large (>500 words)
Dilutes relevance. A chunk about 10 topics matches everything poorly. Also wastes LLM context window tokens.
Match the Chunking Strategy to Its Description
Tap one on the left, then its match on the right
Chunking Pipeline — Order the Steps
Arrange these document processing steps in the correct order
1Store chunk vectors + metadata in the vector database
2Load raw documents (PDFs, web pages, text files)
3Split documents into chunks using chosen strategy and size
4Embed each chunk using an embedding model