Shamim Shams Search

Token Limits Explained: How to Chunk and Process Large Documents

· 8 min read
Token Limits Explained: How to Chunk and Process Large Documents

Your 500-page contract review just threw a context length error. The model has a 200k token context window, and you've still managed to overflow it. Welcome to the practical side of token limits.

Most introductions to this topic start with "tokens are pieces of text." That's true, but it's the wrong thing to know first. What matters is this: every LLM call has a ceiling, you'll hit it more often than you expect, and the strategy you use when you do determines whether your application returns useful output or quietly fails.

What Tokens Are (and Why the Count Is Never What You Think)

A token isn't a character, and it isn't a word. It's roughly a word-piece: a syllable, a common word, or a punctuation cluster. The approximate conversion for English prose: 1 token ≈ 4 characters, or about 0.75 words. A 1,000-word document runs around 1,300–1,500 tokens.

Code runs denser. An import block, a class definition, or a multi-line dict literal will eat more tokens than a prose paragraph of the same length. JSON is the worst offender; all those braces and quotes add up fast.

Not all tokenizers behave the same way. GPT-4 and GPT-3.5 use tiktoken; Claude runs a proprietary system with similar ratios but different behavior on edge cases. Llama-based models use SentencePiece variants. The 1,300-per-1,000-words estimate holds across all three — close enough to plan around, not precise enough to trust at the margins.

To count exactly for OpenAI-compatible models:

pip install tiktoken
import tiktoken

def count_tokens(text: str, model: str = "gpt-4") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

with open("contract.txt") as f:
    doc = f.read()

print(f"Token count: {count_tokens(doc)}")

The Claude SDK exposes this natively, without an extra library:

import anthropic

client = anthropic.Anthropic()

with open("contract.txt") as f:
    doc = f.read()

response = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": doc}]
)
print(f"Token count: {response.input_tokens}")

The Numbers That Actually Matter

Context windows get announced with headline numbers — "1 million tokens!" — but the number you should design around is the output limit, not the input. Most models produce 4k–8k tokens per response regardless of context window size.

Model Context Window Max Output
GPT-4o 128k 16k
Claude Sonnet 4.5 200k 8k
Gemini 1.5 Pro 1M 8k
Llama 3.1 70B 128k 4k

A 200k input window means you can technically send the entire manuscript of a novel. It doesn't mean you'll get a novel-length response back. If you're transforming or summarizing a large document, plan around output, not input.

There's another constraint that doesn't get enough attention: performance degrades across very long contexts. Models tested on "needle in a haystack" benchmarks consistently underperform on information buried in the middle of a long context. A 200k window doesn't guarantee 200k of equal attention — that's a documented behavior, not a footnote, and it shapes whether chunking is worth the extra engineering.

When Does This Actually Break?

I built a contract analysis tool: upload a PDF, get a risk assessment. The first batch ran fine; contracts were 10–20 pages. Then a client sent a master services agreement stack — 200+ pages, boilerplate and all. The full document fit within the 200k window. The output was thin. Critical clauses from section 38 came through clearly; clauses from sections 12–15 didn't register. When I tested on a smaller model with a 32k limit, it errored immediately. The large-window model didn't error; it just quietly omitted things without saying so.

That's the failure mode that gets you. Errors are obvious. Quiet omissions aren't.

My rule: if the document exceeds 30% of the model's context window, chunk it. Past 60%, chunking isn't optional.

Three Chunking Strategies

Fixed-Size Chunking

Split every N tokens, with optional overlap between adjacent chunks. Simple, predictable, fast.

import tiktoken

def chunk_by_tokens(
    text: str,
    chunk_size: int = 1000,
    overlap: int = 100,
    model: str = "gpt-4"
) -> list[str]:
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunks.append(enc.decode(tokens[start:end]))
        start += chunk_size - overlap

    return chunks

The trade-off: fixed splitting doesn't respect sentence or paragraph boundaries. A chunk can end mid-sentence, cutting off context the model needs to interpret the next chunk correctly. Fine for uniform data; fragile on prose.

Sentence-Aware Chunking

The naive version splits blindly on token count. The sentence-aware version accumulates sentences until it hits a threshold, then starts a new chunk:

import re

def chunk_by_sentences(
    text: str,
    max_tokens: int = 1000,
    overlap_sentences: int = 2
) -> list[str]:
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())

    chunks: list[str] = []
    current: list[str] = []
    current_tokens = 0

    for sentence in sentences:
        estimated_tokens = len(sentence.split()) * 1.33

        if current_tokens + estimated_tokens > max_tokens and current:
            chunks.append(" ".join(current))
            current = current[-overlap_sentences:]
            current_tokens = sum(len(s.split()) * 1.33 for s in current)

        current.append(sentence)
        current_tokens += estimated_tokens

    if current:
        chunks.append(" ".join(current))

    return chunks

Sentences stay intact; the cost is slightly higher to compute.

Paragraph-Based Chunking

For structured documents — reports, articles, technical specifications — paragraph boundaries are a reliable semantic proxy:

def chunk_by_paragraphs(
    text: str,
    max_tokens: int = 1000,
    min_tokens: int = 100
) -> list[str]:
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]

    chunks: list[str] = []
    current: list[str] = []
    current_tokens = 0

    for para in paragraphs:
        para_tokens = len(para.split()) * 1.33

        if para_tokens > max_tokens:
            if current:
                chunks.append("\n\n".join(current))
                current = []
                current_tokens = 0
            words = para.split()
            word_chunk_size = int(max_tokens / 1.33)
            for i in range(0, len(words), word_chunk_size):
                chunks.append(" ".join(words[i:i + word_chunk_size]))
            continue

        if current_tokens + para_tokens > max_tokens and current_tokens >= min_tokens:
            chunks.append("\n\n".join(current))
            current = []
            current_tokens = 0

        current.append(para)
        current_tokens += para_tokens

    if current:
        chunks.append("\n\n".join(current))

    return chunks

This is what I reach for first. It respects the author's own structural divisions, which are usually the right boundaries for the model too.

Why Overlap Exists

Overlap means duplicating the tail of one chunk at the head of the next. A sentence that says "as stated in the previous clause" means nothing when the model hasn't seen that clause. That's the problem overlap solves.

How much? For most prose: 10–15% of chunk size. For dense legal or technical documents with heavy cross-references, I've gone up to 20%. For code, skip it entirely — a function is either complete in the chunk or it isn't. Better to split on function boundaries and leave overlap at zero.

Overlap doesn't make your chunks redundant. It makes them independently legible.

Build a Map-Reduce Pipeline

Once you have chunks, the standard pattern is map-reduce: run the same question against each chunk independently, then synthesize the per-chunk results.

import anthropic

client = anthropic.Anthropic()

def map_reduce(
    chunks: list[str],
    chunk_prompt: str,
    reduce_prompt: str,
    model: str = "claude-sonnet-4-6"
) -> str:
    chunk_results: list[str] = []

    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i + 1}/{len(chunks)}...")
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            messages=[
                {
                    "role": "user",
                    "content": f"{chunk_prompt}\n\n---\n\n{chunk}"
                }
            ]
        )
        chunk_results.append(response.content[0].text)

    combined = "\n\n---\n\n".join(chunk_results)
    final = client.messages.create(
        model=model,
        max_tokens=2048,
        messages=[
            {
                "role": "user",
                "content": f"{reduce_prompt}\n\n---\n\n{combined}"
            }
        ]
    )

    return final.content[0].text

A contract with 40 paragraph-based chunks processes like this:

with open("contract.txt") as f:
    doc = f.read()

chunks = chunk_by_paragraphs(doc, max_tokens=2000)

summary = map_reduce(
    chunks,
    chunk_prompt="Extract any obligations, deadlines, or risk clauses from this section:",
    reduce_prompt=(
        "Combine the following extracted clauses into a single risk summary. "
        "Remove duplicates. Organize by severity:"
    )
)

print(summary)

What Actually Goes Wrong

Context loss is the most common failure. A sentence that says "the above section" or "as described earlier" inside a chunk means nothing when the model hasn't seen the earlier section. For documents with dense cross-references — legal agreements, multi-chapter technical specs — inject a brief document summary into the system prompt for every chunk. Two sentences describing the document's overall structure is usually enough.

The reduce step runs out of room. You're synthesizing 40 chunks of legal language with max_tokens=512. The output gets truncated — sometimes silently. Set max_tokens=2048 or higher for anything beyond a handful of chunks.

Order matters in ways that aren't obvious. Chunks are processed independently in the map phase, but document structure carries meaning: a clause in section 1 often qualifies clauses in sections 10–20. Either pass chunk index metadata in your prompts or sequence the reduce step explicitly by document order.

Token counting in the wrong place is where I've lost the most debugging time. Count after constructing the full prompt — system prompt, injected context, then the chunk. The system prompt and any prepended context eat into your budget before the document ever appears.

Wrapping Up

Match the chunking strategy to the document structure. Fixed-size works for uniform data. Sentence-aware works for narrative prose. Paragraph-based works for structured reports. Add overlap when context bleeds across section boundaries, and set your reduce step max_tokens generously.

Map-reduce handles most document processing problems. When it doesn't — when you need the model to reason across the full document rather than per section — the answer is retrieval: embed your chunks, store them in a vector database, and pull only the relevant ones per query. That's a different pattern, and it gets expensive at scale. Worth understanding both before you choose one.