Context Engineering: Why Reducing LLM Token Usage Isn't About Shorter Prompts
When developers want to reduce LLM token usage, they almost always start by trying to make their prompts shorter. Tighter wording. Fewer examples. Compressed instructions.
That helps a little. But it misses where the real waste is.
Token optimization is fundamentally a context-engineering problem, not a prompt-shortening problem. The goal isn't "fewest tokens possible" — it's "smallest set of high-signal tokens for the current step."
Here's why that distinction matters, and how to act on it.
Your prompt is 50 tokens. Your context is 50,000.
In a typical coding agent session, your actual instruction might be 30–80 tokens. But the full input to the model includes your system prompt, conversation history, tool definitions, file contents, and previous responses. That's often 20,000–100,000+ tokens.
Shaving 20 tokens off your prompt saves essentially nothing when the real weight is in the context surrounding it.
The three biggest context-level levers are session management, just-in-time retrieval, and repo memory.
Session management: stop paying for stale history
Every message in a conversation stays in the context window. By message 30, you're paying for 29 previous messages — including failed attempts, irrelevant explorations, and superseded plans — on every single turn.
The fix is to split work into phases and reset context between them:
- Discovery session: Explore the codebase, understand the problem, write a spec
- Implementation session: Start fresh with just the spec, implement the solution
- Verification session: Start fresh again, run tests, review the output
Anthropic explicitly recommends this pattern. A clean session with a better prompt usually beats a long thread full of failed attempts. Stale context doesn't just cost money — it dilutes the model's attention, making the quality of later responses worse.
Think of it this way: if you're paying per token, every stale message is a recurring subscription fee for information the model no longer needs.
Just-in-time retrieval: stop dumping your entire repo
Many developers (and tools) take a brute-force approach to context: dump entire files, full directory listings, or even whole repositories into the prompt. This is expensive and counterproductive.
Research consistently shows that long contexts have diminishing returns. Relevant information buried in the middle of a long context is used less reliably than the same information presented in a focused, shorter context.
The alternative is just-in-time retrieval — pulling in exactly the information the model needs, right when it needs it:
- Targeted file reads instead of directory dumps — read the specific function, not the whole file
- Code intelligence / LSP navigation — use "go to definition" style lookups instead of grep-and-dump
- Hooks that collapse verbose output — instead of pasting 500 lines of test output, pass only the failures
- Iterative retrieval — let the model ask for what it needs in stages
The RepoCoder research found that iterative repository retrieval improved code completion accuracy by more than 10% over in-file completion and beat vanilla RAG approaches. Less context, better results, lower cost.
Repo memory: put durable knowledge outside the chat
Some information — your project's architecture, coding conventions, build commands, tech stack choices — is relevant to every session. If you type it into every conversation, you're paying for it repeatedly and inconsistently.
Instead, put it in durable, structured locations:
- CLAUDE.md (or equivalent config file): Stable architecture rules, conventions, and "how this repo works" summaries. These load automatically every session.
- Skills / Playbooks: Workflow-specific instructions (deploy procedures, review checklists) that load on demand via progressive disclosure — their full content only enters the context when relevant.
Early research on repo-level instruction files found that projects using them saw roughly 17% lower output token usage and 29% lower median runtime on PR-sized tasks. The model wastes fewer tokens figuring out what you could have just told it upfront.
The key constraint: keep these files concise. Since they load every session, every extra line is a recurring cost. Put essentials in the main config file and move specialized workflows out into separate, on-demand files.
The attention budget
There's a subtler reason context engineering matters beyond raw cost: models have a limited attention budget.
Even within a 200K-token context window, the model doesn't attend equally to every token. Information near the beginning and end gets more attention than information buried in the middle. Irrelevant context doesn't just cost money — it actively competes for attention with the information that matters.
This is why a focused 5,000-token context often produces better results than a comprehensive 50,000-token context. You're not just saving money; you're making the model more effective.
Before and after: a real context audit
Here's what a context audit looks like in practice:
Before (a real agent session pattern):
- System prompt with full project docs: 8,000 tokens
- 12 tool definitions (most unused): 15,000 tokens
- Conversation history with 6 dead-end attempts: 25,000 tokens
- Full file dumps of 3 files (only 1 relevant): 12,000 tokens
- Total per turn: ~60,000 input tokens
After (same task, context-engineered):
- Concise CLAUDE.md loaded automatically: 1,200 tokens
- 4 relevant tool definitions: 5,000 tokens
- Fresh session with clear spec: 800 tokens
- Targeted retrieval of relevant functions: 2,000 tokens
- Total per turn: ~9,000 input tokens
Same task. Same quality output. 85% fewer tokens per turn.
References
- Andrej Karpathy on Context Engineering — The distinction between prompt engineering and context engineering
- RepoCoder: Repository-Level Code Completion — Iterative retrieval showing >10% improvement over in-file completion
- Anthropic: Claude Code Best Practices — Session management, CLAUDE.md usage, and context optimization
- Lost in the Middle: How Language Models Use Long Contexts — Research on diminishing returns of long context and attention distribution
For more on the strategies mentioned here, see our complete guide to LLM token optimization strategies or our Claude Code efficiency tips for CLAUDE.md best practices. You can also explore designing for prompt cache hits to make your stable context even cheaper.