Designing for Prompt Cache Hits: How to Save 90% on LLM Input Tokens

Prompt caching is the most powerful cost-reduction feature available on modern LLM APIs. Anthropic's cache reads cost 0.1x the base input price — a 90% discount. OpenAI automatically caches repeated prefixes for a 50% input discount — helpful, but less dramatic than Anthropic's 90%. Google Gemini offers both implicit (automatic) and explicit caching at 75–90% savings.

But many teams enable caching and then wonder why their cache hit rates are low. The problem is almost always the same: their prompts aren't designed for caching.

Caching isn't a toggle you flip. It's an architecture you build around.

How prompt caching works (the critical detail)

Prompt caching works by storing a processed version of your input prefix. On subsequent requests, if the prefix matches exactly, the cached version is reused — skipping the expensive tokenization and processing step.

The critical detail: cache hits require 100% identical prefix segments. If even one token in the cached portion changes between requests, the entire segment is a cache miss. You pay the full input price plus a cache write cost for re-caching.

There are also minimum token thresholds for caching to work at all. Anthropic requires at least 1,024 tokens for Sonnet and Opus models (2,048 for Haiku). OpenAI requires at least 1,024 tokens with 128-token increments. Prompts shorter than these minimums won't be cached regardless of how stable they are.

This means cache design is really about maximizing the size of your stable prefix — the portion of your prompt that stays identical across requests.

The cache write premium

Caching isn't free on the write side. Anthropic charges 1.25x the base input price for cache writes with the default 5-minute TTL. That means the first request with a new prefix costs more than a regular request. Caching only pays off when you get subsequent cache reads at 0.1x.

The break-even is simple: one cache write at 1.25x followed by one cache read at 0.1x gives you a blended cost of 0.675x — already a net win. After two reads, you're at 0.483x. The more reads per write, the closer you get to the 0.1x floor. If your cache hit rate is above 50%, you're saving money. At 85%+, the savings are substantial.

The stable prefix pattern

The fundamental design pattern for cache-friendly prompts is simple:

[Stable content — cached]    → System instructions, background, tool definitions
[Semi-stable content]        → Few-shot examples, reference docs
[Variable content — not cached] → User input, conversation history

Everything that stays the same across requests goes first. Everything that changes goes last.

This seems obvious, but most developers structure their prompts the other way around — putting the user's question first and the context after. Inverting this order is often the single biggest improvement in cache hit rates.

What to put in your stable prefix

The best candidates for caching are:

  • System instructions: Your model's persona, rules, constraints, output format requirements. These rarely change between requests.
  • Tool definitions: If you're using function calling, tool schemas are typically identical across requests. At 500–2,000 tokens per tool, caching 10 tools saves 5,000–20,000 tokens per request.
  • Background context: Project documentation, API references, style guides — anything that provides context but doesn't change per-request.
  • Few-shot examples: If you use consistent examples, they're prime caching material. Just don't shuffle them between requests (see below).

Cache-busting mistakes

These are the most common patterns that accidentally destroy cache hit rates:

Timestamps in system prompts. "Today's date is March 6, 2026" in your system prompt means the cache invalidates every day. If you need the model to know the date, put it in the variable section after the cached prefix, or update it less frequently.

Shuffled few-shot examples. If you randomize the order of your examples on each request "for variety," every order is a unique prefix. Pick a fixed order and stick with it.

Dynamic tool lists. If your available tools change between requests — some tools enabled, some disabled — the tool definition section changes and the cache misses. Either load all tools consistently or use the on-demand tool loading pattern (see reducing tool overhead).

Per-user context in the prefix. Putting user-specific data (name, preferences, history) into the system prompt means each user gets a unique prefix. Move user context to the variable section.

Version strings or build hashes. Embedding deployment metadata in your prompt invalidates the cache on every deploy.

Multi-tier caching with breakpoints

Different parts of your prompt change at different rates:

  • System instructions: change rarely (monthly)
  • Tool definitions: change occasionally (weekly)
  • Background docs: change sometimes (as docs update)
  • User conversation: changes every request

Anthropic supports explicit cache breakpoints (up to 4 per request) that let you cache these tiers independently. A change in your background docs doesn't invalidate the cache for your system instructions and tool definitions.

Anthropic also offers an extended 1-hour TTL at 2x write cost (vs the default 5-minute TTL at 1.25x). This is useful for prompts that are called less frequently but remain stable — the higher write cost pays off if you'd otherwise keep re-caching the same content every 5 minutes.

The pattern:

[System instructions]           → Breakpoint 1 (stable for months)
[Tool definitions]              → Breakpoint 2 (stable for weeks)
[Background docs / examples]    → Breakpoint 3 (stable for days)
[User input + conversation]     → Not cached

If you update your background docs, only that segment re-caches. The first two tiers still hit.

Measuring cache performance

You can't optimize what you don't measure. Key metrics for prompt caching:

  • Cache hit rate: What percentage of requests hit the cache vs. miss? Aim for 80%+ on steady-state traffic.
  • Cache read vs. write tokens: In your API usage dashboard, compare cached read tokens to cache write tokens. High write-to-read ratios indicate frequent cache misses.
  • Cost per request before vs. after: Track the actual cost impact. A well-designed caching setup can reduce input token costs by 70–90%.

Both Anthropic and OpenAI provide usage breakdowns that separate cached from uncached token counts. Monitor these regularly to catch cache-busting regressions.

Provider comparison

Caching works differently across providers. Here's a quick comparison:

| Feature | Anthropic | OpenAI | Google Gemini | |---|---|---|---| | Opt-in | Explicit (cache_control) | Automatic | Both implicit and explicit | | Read discount | 90% (0.1x) | 50% (0.5x) | 75–90% | | Write premium | 1.25x (5min TTL) / 2x (1hr TTL) | None | Storage: ~$4.50/MTok/hr | | TTL | 5 minutes or 1 hour | 5–10 minutes | Configurable (default 1hr) | | Min tokens | 1,024 (Sonnet/Opus) / 2,048 (Haiku) | 1,024 (128-token increments) | Varies by model | | Max breakpoints | 4 per request | N/A (automatic) | N/A |

The key tradeoff: Anthropic gives you the deepest discount (90%) but requires explicit opt-in and charges for writes. OpenAI is zero-effort but the discount is smaller (50%). Google sits between — automatic caching is available, but explicit caching with configurable TTLs offers deeper savings with ongoing storage costs.

The economics at scale

Let's make this concrete. Consider a production app making 10,000 API calls per day with 20,000 input tokens per call:

Without caching:

  • 200M input tokens/day at full price

With well-designed caching (15,000 stable tokens, 5,000 variable, 85% hit rate):

  • 15,000 tokens × 85% = 12,750 tokens at 0.1x price (cache reads)
  • 15,000 tokens × 15% = 2,250 tokens at full price (cache misses)
  • 5,000 tokens always at full price (variable portion)
  • Effective cost reduction: ~60% on input tokens

At Anthropic's Claude Sonnet pricing, that's the difference between spending roughly $60/day and $24/day on input tokens alone. Over a year, you're looking at over $13,000 in savings — from an architecture change, not a feature cut.

References


This post is part of our complete LLM token optimization strategy guide. For related topics, see reducing OpenAI and Claude API token costs and cutting MCP and tool overhead.