Context Engineering: Why Reducing LLM Token Usage Isn't About Shorter Prompts

When developers want to reduce LLM token usage, they almost always start by trying to make their prompts shorter. Tighter wording. Fewer examples. Compressed instructions.

That helps a little. But it misses where the real waste is.

Token optimization is fundamentally a context-engineering problem, not a prompt-shortening problem. The goal isn't "fewest tokens possible" — it's "smallest set of high-signal tokens for the current step."

Here's why that distinction matters, and how to act on it.

Your prompt is 50 tokens. Your context is 50,000.

In a typical coding agent session, your actual instruction might be 30–80 tokens. But the full input to the model includes your system prompt, conversation history, tool definitions, file contents, and previous responses. That's often 20,000–100,000+ tokens.

Shaving 20 tokens off your prompt saves essentially nothing when the real weight is in the context surrounding it.

The three biggest context-level levers are session management, just-in-time retrieval, and repo memory.

Session management: stop paying for stale history

Every message in a conversation stays in the context window. By message 30, you're paying for 29 previous messages — including failed attempts, irrelevant explorations, and superseded plans — on every single turn.

The fix is to split work into phases and reset context between them:

Discovery session: Explore the codebase, understand the problem, write a spec
Implementation session: Start fresh with just the spec, implement the solution
Verification session: Start fresh again, run tests, review the output

Anthropic explicitly recommends this pattern. A clean session with a better prompt usually beats a long thread full of failed attempts. Stale context doesn't just cost money — it dilutes the model's attention, making the quality of later responses worse.

Think of it this way: if you're paying per token, every stale message is a recurring subscription fee for information the model no longer needs.

Just-in-time retrieval: stop dumping your entire repo

Many developers (and tools) take a brute-force approach to context: dump entire files, full directory listings, or even whole repositories into the prompt. This is expensive and counterproductive.

Research consistently shows that long contexts have diminishing returns. Relevant information buried in the middle of a long context is used less reliably than the same information presented in a focused, shorter context.

The alternative is just-in-time retrieval — pulling in exactly the information the model needs, right when it needs it:

Targeted file reads instead of directory dumps — read the specific function, not the whole file
Code intelligence / LSP navigation — use "go to definition" style lookups instead of grep-and-dump
Hooks that collapse verbose output — instead of pasting 500 lines of test output, pass only the failures
Iterative retrieval — let the model ask for what it needs in stages

The RepoCoder research found that iterative repository retrieval improved code completion accuracy by more than 10% over in-file completion and beat vanilla RAG approaches. Less context, better results, lower cost.

Repo memory: put durable knowledge outside the chat

Some information — your project's architecture, coding conventions, build commands, tech stack choices — is relevant to every session. If you type it into every conversation, you're paying for it repeatedly and inconsistently.

Instead, put it in durable, structured locations:

CLAUDE.md (or equivalent config file): Stable architecture rules, conventions, and "how this repo works" summaries. These load automatically every session.
Skills / Playbooks: Workflow-specific instructions (deploy procedures, review checklists) that load on demand via progressive disclosure — their full content only enters the context when relevant.

Early research on repo-level instruction files found that projects using them saw roughly 17% lower output token usage and 29% lower median runtime on PR-sized tasks. The model wastes fewer tokens figuring out what you could have just told it upfront.

The key constraint: keep these files concise. Since they load every session, every extra line is a recurring cost. Put essentials in the main config file and move specialized workflows out into separate, on-demand files.

Automatic conversation compression (Compaction API)

For long-running agent sessions, the Compaction API (Anthropic, beta since February 2026) offers a fourth lever alongside the three above. It enables Opus 4.6, 4.7, and 4.8 to automatically summarize and compress conversation history mid-session, keeping the context window from growing unboundedly without requiring you to reset to a fresh session.

The practical difference: session splitting requires you to manually define phase boundaries and carry a spec forward. The Compaction API handles compression transparently, within a single session. It's most valuable for open-ended agentic tasks where the phase boundaries aren't clear in advance — exploratory debugging, long research sessions, or multi-step automation that compounds over many turns.

For structured workflows (spec → implement → verify), explicit session splitting still gives you tighter control. For less structured long-running work, the Compaction API can be a drop-in alternative that avoids the context ceiling without requiring manual session management.

Cross-session memory and Dreaming

A complementary lever for agents that work across many sessions is persistent memory. Anthropic's Memory tool (public beta on the Claude Platform since April 23, 2026) lets a Claude Managed Agent store notes, facts, and learnings as files on a server-side filesystem that persists between sessions. Each new session loads only the relevant memory files instead of replaying prior conversation history — turning what would be a 50K-token re-load into a few hundred tokens of curated state.

At Code with Claude 2026 (May 6), Anthropic introduced Dreaming: a scheduled, asynchronous job that reviews past sessions and consolidates the memory store between runs. The analogy is human sleep consolidation — the agent prunes redundant entries, reorganizes related facts, and lifts shared learnings up to a level where future sessions hit them faster. The token-economics effect is similar to a server-side garbage collector for your context: stale and redundant facts get evicted before they ever load into a future request. Anthropic also shipped an Outcomes feature (a separate grading agent that scores and re-runs tasks) and multi-agent orchestration (a lead agent delegates to specialist sub-agents working in parallel on a shared filesystem) — both of which let you isolate noisy exploration in sub-agent context instead of bloating the lead agent's window.

For long-running production agents, this is the natural extension of session splitting: phase boundaries between sessions still matter, but now what carries forward is a compact, curated memory store rather than a re-pasted spec. Combined with Compaction within a session, the two patterns let an agent run effectively forever without unbounded context growth.

The attention budget

There's a subtler reason context engineering matters beyond raw cost: models have a limited attention budget.

Even within a 200K-token context window, the model doesn't attend equally to every token. Information near the beginning and end gets more attention than information buried in the middle. Irrelevant context doesn't just cost money — it actively competes for attention with the information that matters.

This is why a focused 5,000-token context often produces better results than a comprehensive 50,000-token context. You're not just saving money; you're making the model more effective.

Before and after: a real context audit

Here's what a context audit looks like in practice:

Before (a real agent session pattern):

System prompt with full project docs: 8,000 tokens
12 tool definitions (most unused): 15,000 tokens
Conversation history with 6 dead-end attempts: 25,000 tokens
Full file dumps of 3 files (only 1 relevant): 12,000 tokens
Total per turn: ~60,000 input tokens

After (same task, context-engineered):

Concise CLAUDE.md loaded automatically: 1,200 tokens
4 relevant tool definitions: 5,000 tokens
Fresh session with clear spec: 800 tokens
Targeted retrieval of relevant functions: 2,000 tokens
Total per turn: ~9,000 input tokens

Same task. Same quality output. 85% fewer tokens per turn.

References

Andrej Karpathy on Context Engineering — The distinction between prompt engineering and context engineering
RepoCoder: Repository-Level Code Completion — Iterative retrieval showing >10% improvement over in-file completion
Anthropic: Claude Code Best Practices — Session management, CLAUDE.md usage, and context optimization
Lost in the Middle: How Language Models Use Long Contexts — Research on diminishing returns of long context and attention distribution
Anthropic: Compaction API — Automatic conversation summarization for long-running agentic sessions
Anthropic: Memory tool — Persistent server-side memory for Managed Agents (public beta, April 2026)
Anthropic: Managed Agents (Dreaming, Outcomes, multi-agent orchestration) — Cross-session memory consolidation and sub-agent isolation introduced at Code with Claude 2026

For more on the strategies mentioned here, see our complete guide to LLM token optimization strategies or our Claude Code efficiency tips for CLAUDE.md best practices. You can also explore designing for prompt cache hits to make your stable context even cheaper.