Designing for Prompt Cache Hits: How to Save 90% on LLM Input Tokens

Prompt caching is the most powerful cost-reduction feature available on modern LLM APIs. Anthropic's cache reads cost 0.1x the base input price — a 90% discount. This applies across the full lineup: Opus 4.8 ($0.50/MTok cached), Opus 4.7/4.6 ($0.50), Sonnet 4.6 ($0.30), Haiku 4.5 ($0.10), and the new Claude Fable 5 ($1/MTok cached, June 9, 2026). OpenAI's older models cache at 50%, but GPT-5.5 and GPT-5.4 now offer a 90% cached-input discount — matching Anthropic's rate. One exception: GPT-5.5 Pro ($30/$180/MTok) has no cached-input discount. Google Gemini offers both implicit (automatic) and explicit caching — Gemini 2.5+ and Gemini 3 models get a 90% discount on cached tokens, while Gemini 2.0 models get 75%.

But many teams enable caching and then wonder why their cache hit rates are low. The problem is almost always the same: their prompts aren't designed for caching.

Caching isn't a toggle you flip. It's an architecture you build around.

How prompt caching works (the critical detail)

Prompt caching works by storing a processed version of your input prefix. On subsequent requests, if the prefix matches exactly, the cached version is reused — skipping the expensive tokenization and processing step.

The critical detail: cache hits require 100% identical prefix segments. If even one token in the cached portion changes between requests, the entire segment is a cache miss. You pay the full input price plus a cache write cost for re-caching.

There are also minimum token thresholds for caching to work at all. Thresholds vary by model: Anthropic requires at least 4,096 tokens for Opus 4.8/4.7/4.6/4.5, Haiku 4.5, and Fable 5, 2,048 tokens for Sonnet 4.6 and Haiku 3.5, and 1,024 tokens for Sonnet 4.5 and earlier Sonnet models. Note that Opus 4.7/4.8 and Fable 5's new tokenizer counts up to 35% more tokens for the same text — a prompt that just meets the Opus 4.6 caching threshold should comfortably meet it on Opus 4.7/4.8 as well. OpenAI requires at least 1,024 tokens with 128-token increments. Prompts shorter than these minimums won't be cached regardless of how stable they are.

This means cache design is really about maximizing the size of your stable prefix — the portion of your prompt that stays identical across requests.

The cache write premium

Caching isn't free on the write side. Anthropic charges 1.25x the base input price for cache writes with the default 5-minute TTL. That means the first request with a new prefix costs more than a regular request. Caching only pays off when you get subsequent cache reads at 0.1x.

The break-even is simple: one cache write at 1.25x followed by one cache read at 0.1x gives you a blended cost of 0.675x — already a net win. After two reads, you're at 0.483x. The more reads per write, the closer you get to the 0.1x floor. If your cache hit rate is above 50%, you're saving money. At 85%+, the savings are substantial.

The stable prefix pattern

The fundamental design pattern for cache-friendly prompts is simple:

[Stable content — cached]    → System instructions, background, tool definitions
[Semi-stable content]        → Few-shot examples, reference docs
[Variable content — not cached] → User input, conversation history

Everything that stays the same across requests goes first. Everything that changes goes last.

This seems obvious, but most developers structure their prompts the other way around — putting the user's question first and the context after. Inverting this order is often the single biggest improvement in cache hit rates.

What to put in your stable prefix

The best candidates for caching are:

System instructions: Your model's persona, rules, constraints, output format requirements. These rarely change between requests.
Tool definitions: If you're using function calling, tool schemas are typically identical across requests. At 500–2,000 tokens per tool, caching 10 tools saves 5,000–20,000 tokens per request.
Background context: Project documentation, API references, style guides — anything that provides context but doesn't change per-request.
Few-shot examples: If you use consistent examples, they're prime caching material. Just don't shuffle them between requests (see below).

Cache-busting mistakes

These are the most common patterns that accidentally destroy cache hit rates:

Timestamps in system prompts. "Today's date is March 6, 2026" in your system prompt means the cache invalidates every day. If you need the model to know the date, put it in the variable section after the cached prefix, or update it less frequently.

Shuffled few-shot examples. If you randomize the order of your examples on each request "for variety," every order is a unique prefix. Pick a fixed order and stick with it.

Dynamic tool lists. If your available tools change between requests — some tools enabled, some disabled — the tool definition section changes and the cache misses. Either load all tools consistently or use the on-demand tool loading pattern (see reducing tool overhead).

Per-user context in the prefix. Putting user-specific data (name, preferences, history) into the system prompt means each user gets a unique prefix. Move user context to the variable section.

Version strings or build hashes. Embedding deployment metadata in your prompt invalidates the cache on every deploy.

Multi-tier caching with breakpoints

Different parts of your prompt change at different rates:

System instructions: change rarely (monthly)
Tool definitions: change occasionally (weekly)
Background docs: change sometimes (as docs update)
User conversation: changes every request

Anthropic supports explicit cache breakpoints (up to 4 per request) that let you cache these tiers independently. A change in your background docs doesn't invalidate the cache for your system instructions and tool definitions.

Anthropic also offers an extended 1-hour TTL at 2x write cost (vs the default 5-minute TTL at 1.25x). This is useful for prompts that are called less frequently but remain stable — the higher write cost pays off if you'd otherwise keep re-caching the same content every 5 minutes. You can mix both TTLs in the same request; longer-TTL entries must simply appear before shorter-TTL ones.

The pattern:

[System instructions]           → Breakpoint 1 (stable for months)
[Tool definitions]              → Breakpoint 2 (stable for weeks)
[Background docs / examples]    → Breakpoint 3 (stable for days)
[User input + conversation]     → Not cached

If you update your background docs, only that segment re-caches. The first two tiers still hit.

Measuring cache performance

You can't optimize what you don't measure. Key metrics for prompt caching:

Cache hit rate: What percentage of requests hit the cache vs. miss? Aim for 80%+ on steady-state traffic.
Cache read vs. write tokens: In your API usage dashboard, compare cached read tokens to cache write tokens. High write-to-read ratios indicate frequent cache misses.
Cost per request before vs. after: Track the actual cost impact. A well-designed caching setup can reduce input token costs by 70–90%.

Both Anthropic and OpenAI provide usage breakdowns that separate cached from uncached token counts. Monitor these regularly to catch cache-busting regressions.

Provider comparison

Caching works differently across providers. Here's a quick comparison:

| Feature | Anthropic | OpenAI | Google Gemini | |---|---|---|---| | Opt-in | Explicit or automatic (cache_control) | Automatic | Both implicit and explicit | | Read discount | 90% (0.1x) | 90% on GPT-5.5 and GPT-5.4; 0% on GPT-5.5 Pro; 75% on reasoning models (o3, o4-mini); 50% on GPT-4.x | 90% (Gemini 2.5+, Gemini 3.x) | | Write premium | 1.25x (5min TTL) / 2x (1hr TTL) | None | Storage: ~$4.50/MTok/hr (Pro), $1.00/MTok/hr (Flash) | | TTL | 5 minutes or 1 hour | 24h (GPT-5.5+); 5–10 minutes (GPT-5.4 and older) | Configurable (default 1hr) | | Min tokens | 4,096 (Opus 4.8/4.7, Opus 4.6/4.5, Haiku 4.5, Fable 5) / 2,048 (Sonnet 4.6) / 1,024 (Sonnet 4.5 and earlier) | 1,024 (128-token increments) | 32,768 (explicit); implicit has no minimum | | Max breakpoints | 4 per request (explicit) | N/A (automatic) | N/A |

The key tradeoff: Anthropic gives you the deepest discount (90%) but requires explicit opt-in and charges for writes. OpenAI is zero-effort — 90% on GPT-5.5 and GPT-5.4 (though 0% on GPT-5.5 Pro), 50% on older GPT-4.x. Google sits between — automatic caching is available, but explicit caching with configurable TTLs offers deeper savings with ongoing storage costs. Note: Gemini 2.0 Flash and Flash-Lite are deprecated and shutting down June 1, 2026 — migrate to Gemini 2.5 Flash or the Gemini 3.x series.

GPT-5.5 caching nuance: OpenAI dropped the previous in-memory prompt cache for GPT-5.5 and GPT-5.5 Pro entirely — only extended prompt caching (up to 24-hour retention) is available on these models. For most workloads this is an upgrade, since cached prefixes survive far longer than the previous ~5–10 minute in-memory window. But it means you should explicitly enable extended caching for stable prefixes on GPT-5.5; otherwise you may see cache misses on prefixes that would have hit on older models. GPT-5.4 and earlier retain both in-memory (default) and extended caching.

Automatic caching (Anthropic)

Anthropic now supports automatic caching for multi-turn conversations. Instead of manually placing cache_control markers on individual content blocks, you add a single cache_control field at the top level of your request. The API automatically manages cache breakpoints as conversations grow, moving the breakpoint forward with each turn. This is the recommended starting point for most use cases — you get cache savings without managing breakpoints manually. Use explicit block-level caching when you need fine-grained control over exactly what gets cached.

The economics at scale

Let's make this concrete. Consider a production app making 10,000 API calls per day with 20,000 input tokens per call:

Without caching:

200M input tokens/day at full price

With well-designed caching (15,000 stable tokens, 5,000 variable, 85% hit rate):

15,000 tokens × 85% = 12,750 tokens at 0.1x price (cache reads)
15,000 tokens × 15% = 2,250 tokens at full price (cache misses)
5,000 tokens always at full price (variable portion)
Effective cost reduction: ~60% on input tokens

At Anthropic's Claude Sonnet 4.6 pricing ($3/MTok input), that's the difference between spending roughly $60/day and $24/day on input tokens alone. Over a year, you're looking at over $13,000 in savings — from an architecture change, not a feature cut.

References

Anthropic: Prompt Caching — Official documentation on cache breakpoints, TTL options, automatic caching, write costs, and minimum token thresholds
OpenAI: Prompt Caching — Automatic caching, 50% discount, and 1,024-token minimum
Google: Context Caching — Implicit and explicit caching for Gemini models with configurable TTLs
Anthropic: Claude Pricing — Current token pricing including long-context rates and cache read/write rates

This post is part of our complete LLM token optimization strategy guide. For related topics, see reducing OpenAI and Claude API token costs and cutting MCP and tool overhead.