Token-Efficient Prompting Patterns: Chain of Draft, Output Formats, and Prompt Compression

Most token optimization advice focuses on what you send to the model — trimming context, caching prefixes, pruning tools. But the way you prompt the model matters just as much. The right prompting patterns can reduce token usage by 50–90% with minimal impact on quality.

This guide covers four techniques that are underused in production: Chain of Draft, output format optimization, prompt compression, and semantic caching.

Chain of Draft: 92% fewer reasoning tokens

Chain of Thought (CoT) prompting — asking the model to "think step by step" — has been the default technique for improving accuracy on complex tasks since 2022. But CoT is expensive. The model generates long, verbose reasoning traces that consume output tokens (the most expensive kind).

Chain of Draft (CoD) is a 2025 technique that matches CoT accuracy while using as little as 7.6% of the tokens — a 92% reduction.

The idea is simple: instead of "think step by step," you instruct the model to keep each reasoning step to a minimum — roughly 5 words per step. The model still reasons through the problem, but it drafts each step rather than writing an essay about it.

A simplified prompt addition:

Think step by step, but for each step, write only the minimal
necessary text (ideally ~5 words). Skip explanations.

Research on arithmetic, common-sense, and symbolic reasoning benchmarks showed CoD matched or surpassed CoT accuracy across the board. The savings come entirely from shorter reasoning traces — the model still follows the same logical steps, it just doesn't narrate them.

When CoT isn't worth the tokens

Here's an important nuance: reasoning models (OpenAI's o3-mini, o4-mini, and similar) already do internal chain-of-thought. Adding explicit CoT prompting to these models yields only a 2.9–3.1% accuracy improvement — far too small to justify the 10–13x token overhead. Save explicit reasoning prompts for standard models.

Related techniques worth knowing:

Concise Chain-of-Thought (CCoT): Reduces response length by ~49% on multiple-choice tasks by prompting for brief justifications
Focused Chain-of-Thought (F-CoT): Reduces token usage 2–3x via structured input that focuses the model's attention

Output format: JSON costs 2x more than you think

The format you request for structured output has a significant impact on token count. JSON — the default choice for most developers — is one of the most token-expensive formats available.

JSON's overhead comes from its syntax: curly braces, square brackets, colons, commas, and quoted keys all consume tokens. For the same data, JSON uses approximately 2x more tokens than more compact alternatives.

Consider a simple data structure in different formats:

{"name": "Alice", "age": 30, "role": "engineer"}

name: Alice
age: 30
role: engineer

Alice\t30\tengineer

The JSON version uses roughly 19 tokens. The YAML version uses about 13. The TSV version uses about 7. At scale — millions of API calls processing structured data — this adds up fast.

TOON (Token-Oriented Object Notation) is a format designed specifically for LLM efficiency, released in late 2024. It strips JSON's syntactic overhead while remaining machine-parseable, reducing token usage by 30–60% compared to JSON.

When to care about output format

This optimization matters most when:

You're processing high volumes of structured data (classification results, extracted entities, metadata)
The structured output is an intermediate step (not shown to users)
You're running batch processing pipelines where token costs compound

When the output goes directly to a frontend or a typed API, JSON's compatibility benefits usually outweigh its token cost. But for internal pipelines, consider YAML or TSV as drop-in replacements.

Prompt compression: 20x reduction for long contexts

Prompt compression tools like LLMLingua analyze your prompt and remove tokens that contribute least to the model's understanding. The results can be dramatic: an 800-token prompt compressed to 40 tokens (95% reduction) while preserving the model's ability to answer correctly.

More conservatively, 2–5x compression ratios are typical in production while maintaining quality. The technique works by:

Scoring each token in the prompt by how much it contributes to the model's predictions
Removing low-contribution tokens (articles, filler words, redundant context)
Passing the compressed prompt to the target model

This is most useful for RAG pipelines where retrieved chunks contain a lot of surrounding text that's not directly relevant. Compressing retrieved context before injection can significantly reduce input token costs.

The tradeoff: compression adds a preprocessing step (typically using a small model like GPT-2 or a dedicated compression model). This adds latency and a small compute cost. For real-time applications, the latency may not be acceptable. For batch workloads or high-volume RAG systems, the net savings are substantial.

Task-specific compression risks

A 2026 production study found that aggressive compression can trigger output token explosion on some tasks — particularly chain-of-thought and multi-step reasoning. When the compressed prompt omits too many structural cues, the model compensates by generating more verbose output, increasing total cost even as input tokens fall. Code generation tends to be resilient at high compression ratios; reasoning-heavy tasks degrade more sharply. Validate compression ratios per task type before deploying broadly.

PCToolkit (2025) provides a unified API bundling five compressors — Selective Context, LLMLingua, LongLLMLingua, SCRL, and Keep It Simple (KiS) — making it easier to benchmark which approach works best for a given task type before committing to one.

Semantic caching: save tokens you've already spent

Semantic caching is an application-level optimization — distinct from the provider-level prompt caching covered in our cache architecture guide.

Provider prompt caching requires exact prefix matches. Semantic caching uses embeddings to match queries that are semantically similar, even if they're worded differently. "How do I reset my password?" and "I forgot my password, what do I do?" would be a cache miss for prompt caching but a hit for semantic caching.

Implementation typically uses a vector database (Redis, Pinecone, Weaviate) to store query-response pairs:

On each new query, generate an embedding
Search for similar queries in the cache (cosine similarity above a threshold)
If found, return the cached response without calling the LLM
If not, call the LLM and store the result

In high-repetition workloads (customer support, FAQ-style queries), semantic caching has demonstrated up to 73% cost reduction. The key consideration is setting the similarity threshold correctly — too low and you serve stale or incorrect cached responses; too high and you rarely get hits.

The strongest approach is double caching: provider-level prompt caching for the stable prefix (system instructions, tools) combined with application-level semantic caching for repeated queries. These stack — you can get both the 90% discount on cached prefixes and avoid the API call entirely for semantically similar questions.

References

Chain of Draft: Thinking Faster by Writing Less — The original CoD paper demonstrating 92% token reduction while matching CoT accuracy
TOON: Token-Oriented Object Notation — 30–60% token reduction over JSON for structured data
LLMLingua: Compressing Prompts for Accelerated Inference — Up to 20x prompt compression while preserving model accuracy
Prompt Compression in Production (2026) — Task-specific compression tradeoffs including output token explosion on reasoning tasks
Redis: LLM Token Optimization with Semantic Caching — Semantic caching achieving ~73% cost reduction
Concise Chain-of-Thought Prompting — 49% response length reduction on multiple-choice tasks

This guide is part of our complete LLM token optimization strategy guide. For related topics, see designing for prompt cache hits and reducing OpenAI and Claude API token costs.