guide

Cached Tokens Explained: Save 50-90% on AI Costs

Learn how cached tokens cut AI API costs, when prompt caching applies, and how to design GPT, Claude, and Gemini workflows for 50-90% savings.

By AI Pricing Guru Editorial Team

Cached tokens are one of the easiest AI cost savings to miss.

Most teams understand that AI APIs charge for input tokens and output tokens. Fewer teams model the third line item that now matters for serious production workloads: cached input tokens.

That is a problem, because cached tokens can turn an expensive agent, coding assistant, RAG app, or support bot into something much easier to scale. In many current rate cards, repeated prompt text is priced 50-90% lower than normal input tokens.

If you are sending the same system prompt, policy instructions, product catalog, repo context, or retrieval wrapper again and again, caching may be the difference between a workflow that feels too expensive and one that is commercially viable.

For live provider rates, keep the OpenAI pricing page, Anthropic pricing page, Google Gemini pricing page, and token cost calculator open while you read.

The Short Version

Provider / modelStandard inputCached inputEffective savings
GPT-5.4 mini$0.75 / 1M$0.075 / 1M90%
GPT-5.4$2.50 / 1M$0.25 / 1M90%
Claude Sonnet 4.6$3.00 / 1M$0.30 / 1M90%
Gemini 2.5 Flash$0.30 / 1M$0.03 / 1M90%
GPT-4.1$2.00 / 1M$0.50 / 1M75%

The exact mechanics differ by provider, but the pricing lesson is consistent:

Repeated input is often much cheaper than fresh input.

That means your AI cost model should separate three things:

  1. Fresh input tokens — new user text, new documents, new retrieved chunks
  2. Cached input tokens — repeated prompt prefixes or reused context
  3. Output tokens — what the model generates

If you only estimate total input tokens, you will overestimate some workloads and underestimate others.

What Are Cached Tokens?

Cached tokens are input tokens the provider recognizes as repeated from a previous request and bills at a discounted rate.

In plain English: if you keep sending the same beginning of a prompt, the provider may not need to process it from scratch every time. Instead, it can reuse cached computation and charge you less for that repeated portion.

A typical production prompt might look like this:

[system instructions]
[company policy]
[formatting rules]
[tool descriptions]
[retrieval context]
[user question]

The first several blocks often stay stable across requests. The user question changes. Some retrieval context changes. The formatting rules and tool descriptions may be identical all day.

Prompt caching rewards that structure.

The more stable the repeated prefix, the more tokens can potentially be billed at the cached-input rate instead of the full-input rate.

Why Cached Tokens Matter More in 2026

Three trends make caching much more important than it used to be.

1. Prompts are getting longer

Modern AI apps do not just send a short user message. They send system instructions, examples, safety rules, tool schemas, memory, retrieved documents, and conversation history.

That makes input cost a bigger part of the bill.

If you are building with long-context models, read Understanding Context Windows and Their Cost Impact next. Context size and caching are tightly connected.

2. Agent workflows reuse a lot of context

Agents often run multiple calls against the same task state:

  • plan the task
  • call a tool
  • inspect the result
  • update the plan
  • call another tool
  • write the final answer

Without caching, every turn can resend the same instructions and tool definitions at full price. With caching, those repeated prefixes can become much cheaper.

This is especially relevant for coding tools, support agents, research assistants, and internal automation.

3. Output is still expensive, so input savings compound

Cached tokens do not make output cheaper. But they lower the repeated-input side of the bill, which gives you more room to spend where quality matters.

For example, a support agent may need a strong model for final answers, but it should not pay full price every time it sends the same policy preamble.

That is the core value of caching: spend less on repetition, preserve budget for reasoning and output.

A Simple Cost Example

Imagine a coding assistant that sends the following on every request:

  • 60,000 tokens of stable repo instructions, tool descriptions, and project context
  • 5,000 tokens of fresh user request and fresh file snippets
  • 4,000 output tokens

Now assume the workload runs on GPT-5.4 mini:

  • Standard input: $0.75 / 1M tokens
  • Cached input: $0.075 / 1M tokens
  • Output: $4.50 / 1M tokens

First request, before caching helps

Input cost:

  • 65,000 input tokens × $0.75 / 1M = $0.04875

Output cost:

  • 4,000 output tokens × $4.50 / 1M = $0.01800

Total:

  • $0.06675 per request

That is not huge, but it adds up quickly at scale.

Repeat request, with 60K tokens cached

Fresh input cost:

  • 5,000 tokens × $0.75 / 1M = $0.00375

Cached input cost:

  • 60,000 tokens × $0.075 / 1M = $0.00450

Output cost:

  • 4,000 tokens × $4.50 / 1M = $0.01800

Total:

  • $0.02625 per request

That is a 61% lower total request cost even though output did not get cheaper.

The larger and more stable your prompt prefix is, the more caching matters.

Where Prompt Caching Helps Most

Cached tokens are not equally valuable everywhere. They matter most when you repeatedly send a large amount of stable context.

Coding assistants

Coding tools often reuse:

  • repository summaries
  • coding standards
  • tool definitions
  • system prompts
  • test instructions
  • project architecture notes

That makes them natural caching candidates.

If your team is comparing coding tools and model APIs, our Best AI for Coding in 2026 guide gives the broader product-level view.

RAG applications

RAG apps can benefit when the prompt wrapper is stable even if retrieved chunks change.

Examples:

  • answer format rules
  • citation requirements
  • role instructions
  • compliance constraints
  • tool schemas

The retrieved content may be fresh, but the surrounding structure can still be cache-friendly.

Customer support agents

Support bots often include the same policy, escalation rules, brand voice, and response format on every call.

Caching helps because you should not pay full input price repeatedly for instructions that barely change.

Batch enrichment jobs

If you process thousands of similar records with the same prompt template, caching can lower the cost of the repeated instructions.

For offline jobs, also compare batch discounts where available. Prompt caching and batch pricing can stack conceptually: one reduces repeated input cost, the other reduces the overall rate for async work.

Where Cached Tokens Do Not Help Much

Caching is powerful, but it is not magic.

It helps less when:

  • every prompt is completely different
  • the repeated prefix is tiny
  • you constantly reorder instructions
  • you inject changing values near the start of the prompt
  • output tokens dominate the bill
  • the provider does not cache your prompt shape

This last point matters. Prompt caching usually depends on exact or near-exact reuse of a prefix. If your app inserts timestamps, random IDs, user-specific metadata, or changing retrieval content at the top of the prompt, you may accidentally break caching.

How to Design Prompts for Better Caching

The best caching strategy is mostly prompt hygiene.

1. Put stable content first

Start with the parts least likely to change:

  1. system instructions
  2. developer rules
  3. tool schemas
  4. response format
  5. stable examples
  6. user-specific or request-specific content

Do not put volatile data at the top if you can avoid it.

2. Keep wording stable

If you rewrite the system prompt every deploy, caching effectiveness can drop.

That does not mean you should never improve prompts. It means you should treat prompt changes like code changes: deliberate, versioned, and measured.

3. Separate stable and fresh retrieval context

In RAG apps, do not mix stable instructions with changing document chunks in one messy block.

Use a structure like:

[stable instructions]
[stable output format]
[stable citation rules]
[fresh retrieved passages]
[fresh user question]

This gives the provider a better chance to cache the repeated prefix.

4. Avoid timestamps and random IDs in the prefix

A timestamp at the top of the prompt can make every request look different.

If you need request metadata, put it later or only include it when the model truly needs it.

5. Measure first-run and repeat-run cost separately

A cached workflow has two economics:

  • the first run, where most input may be uncached
  • repeat runs, where stable input may be discounted

Do not judge a workflow only by the first request.

Cached Tokens vs Fine-Tuning vs RAG

Caching is often confused with other optimization tools.

Here is the difference:

TechniqueMain goalBest for
Prompt cachingLower repeated input costStable prompts, agents, tools, support bots
RAGAdd fresh external knowledgeDocs, support, knowledge bases, search
Fine-tuningChange model behavior or styleRepeated task patterns, classification, tone
Model routingUse cheaper models when possibleHigh-volume apps with mixed difficulty

Caching is usually the lowest-friction optimization because it does not require training data or a new architecture. You are mostly arranging prompts so repeated content stays repeatable.

How Much Can You Really Save?

Provider rate cards often advertise cached-input prices that are 50-90% cheaper than standard input. But total request savings depend on your mix.

If input is most of your bill and much of it is repeated, caching can be dramatic.

If output dominates your bill, caching still helps, but the total percentage savings will be smaller.

A practical way to estimate savings:

  1. Count average input tokens per request
  2. Split them into stable repeated tokens and fresh tokens
  3. Apply cached-input pricing to the repeated portion
  4. Apply normal input pricing to fresh tokens
  5. Add output cost separately

You can do that manually or use our token cost calculator to model the same workload across providers.

For a broader budgeting walkthrough, read How to Calculate AI API Costs.

Bottom Line

Cached tokens are not a niche billing detail anymore. They are one of the main levers for controlling AI API costs.

The rule is simple:

Put stable prompt content first, keep it stable, and measure cached-input cost separately from fresh-input cost.

If you do that, many production workflows become much cheaper without changing models or lowering quality.

Caching will not fix every AI bill. You still need model routing, output limits, retrieval discipline, and good prompt design. But if your app sends the same context again and again, cached tokens are probably the first savings lever to check.

FAQ

Are cached tokens automatic?

Sometimes. Some providers apply caching automatically to eligible repeated prefixes, while others expose more explicit prompt-caching behavior. Always check the current provider docs before assuming a workload qualifies.

Do cached tokens reduce output cost?

No. Cached-token discounts apply to repeated input. Output tokens are still billed at the model’s normal output rate.

What is the easiest way to improve cache hit rate?

Put stable instructions, tool schemas, and formatting rules at the beginning of the prompt. Move request-specific data, timestamps, retrieved documents, and user messages later.

Is prompt caching worth it for small prompts?

Usually less so. If your prompt is only a few hundred tokens, output cost and model choice may matter more. Caching becomes more valuable as repeated input grows.

Can cached tokens make premium models affordable?

They can help a lot, especially for long-context and agent workflows. But premium models still have expensive output rates, so you should combine caching with output limits and cheaper-model routing where possible.