How to Calculate AI API Costs (2026)
A practical guide to estimating OpenAI, Claude, Gemini, and DeepSeek API spend, with simple formulas, worked examples, and common budgeting mistakes.
Most teams do not overspend on AI because the per-token rates are confusing. They overspend because they guess.
The fix is simple: break your usage into input tokens, cached input tokens, and output tokens, then multiply each by the model’s price.
This guide shows the exact formula, the mistakes that blow up budgets, and a few worked examples using current April 2026 pricing from OpenAI, Anthropic, Google, and DeepSeek.
If you want the short version, here it is:
AI API cost = (input tokens × input rate) + (cached input tokens × cached rate) + (output tokens × output rate)
Everything else is just getting the right token counts.
Step 1: Know the Three Token Buckets
Every serious API bill is driven by three numbers:
- Input tokens: what you send to the model
- Cached input tokens: repeated prompt/context that gets discounted
- Output tokens: what the model sends back
This matters because the three buckets are priced differently.
| Model | Input / 1M | Cached Input / 1M | Output / 1M |
|---|---|---|---|
| GPT-5.4 mini | $0.75 | $0.075 | $4.50 |
| Claude Sonnet 4.6 | $3.00 | $0.30 | $15.00 |
| Gemini 3.1 Flash-Lite | $0.25 | $0.03 | $1.50 |
| DeepSeek V3.2 | $0.28 | $0.028 | $0.42 |
Two quick takeaways:
- Output is usually the expensive part. If your app generates long answers, your budget can spike fast.
- Caching changes the math. Reused system prompts, long instructions, and stable context often get dramatically cheaper on repeat calls.
For current model-by-model pricing, see our OpenAI pricing page, Anthropic pricing page, Google AI pricing page, and DeepSeek pricing page.
Step 2: Use the Core Formula
Here is the formula in plain English:
Monthly cost =
(uncached input tokens / 1,000,000 × input price)+ (cached input tokens / 1,000,000 × cached input price)+ (output tokens / 1,000,000 × output price)
Or as shorthand:
cost = (input × in_rate) + (cached × cache_rate) + (output × out_rate)
If your provider gives you a batch discount, apply that after the normal calculation for the batch-eligible traffic.
Step 3: Estimate Tokens Per Request
You do not need perfect forecasting on day one. You just need a reasonable baseline.
A good starting point:
- 1 token ≈ 0.75 English words
- 1,000 tokens ≈ 750 words
- Code, JSON, logs, and tool outputs often use more tokens than plain English
For budgeting, measure these separately:
- system prompt
- retrieved context or attached docs
- conversation history
- tool output
- final answer
That last one matters more than most teams expect. A model that writes 1,200-token answers instead of 250-token answers can multiply your bill even if the input stays the same.
If you need a refresher on token basics, read What Is a Token in AI?. If you want to test your own numbers, use our token calculator or the more detailed token cost calculator.
Worked Example 1: A Cheap Support Bot
Let’s say your support bot handles a month of traffic with:
- 50 million input tokens
- 10 million output tokens
- no meaningful caching yet
GPT-5.4 nano
- Input: 50M × $0.20 = $10.00
- Output: 10M × $1.25 = $12.50
- Total: $22.50/month
DeepSeek V3.2
- Input: 50M × $0.28 = $14.00
- Output: 10M × $0.42 = $4.20
- Total: $18.20/month
Claude Haiku 4.5
- Input: 50M × $1.00 = $50.00
- Output: 10M × $5.00 = $50.00
- Total: $100.00/month
What this tells you
If your workload is simple and high volume, sticker-price differences compound quickly. Claude Haiku may still be the right choice if quality is better for your use case, but the budget gap is real. For a pure FAQ or routing bot, DeepSeek or GPT nano is usually the first place to start.
For a broader cheap-model ranking, see our cheapest AI API guide.
Worked Example 2: A Coding Assistant with Cached Context
Now let’s model a more expensive app: a coding assistant with lots of repeated repo context.
Monthly usage:
- 30M uncached input tokens
- 70M cached input tokens
- 20M output tokens
GPT-5.4 mini
- Uncached input: 30M × $0.75 = $22.50
- Cached input: 70M × $0.075 = $5.25
- Output: 20M × $4.50 = $90.00
- Total: $117.75/month
Claude Sonnet 4.6
- Uncached input: 30M × $3.00 = $90.00
- Cached input: 70M × $0.30 = $21.00
- Output: 20M × $15.00 = $300.00
- Total: $411.00/month
What this tells you
Caching helps both models, but it does not erase the output-cost gap. In this example, output tokens dominate the bill.
That is why many teams end up with a hybrid setup:
- use Claude Sonnet for the hardest coding tasks
- route simpler edits, summaries, and tests to cheaper models
- trim output length aggressively
We break down the tradeoff in more detail in OpenAI vs Anthropic API Pricing (2026) and Best AI API for Developers.
Worked Example 3: A RAG App with Heavy Reuse
RAG and document-analysis apps often have a lot of repeated instructions and retrieval scaffolding.
Monthly usage:
- 40M uncached input tokens
- 160M cached input tokens
- 15M output tokens
Gemini 2.5 Flash
- Uncached input: 40M × $0.30 = $12.00
- Cached input: 160M × $0.03 = $4.80
- Output: 15M × $2.50 = $37.50
- Total: $54.30/month
Why this matters
When you can keep prompts stable and reuse them, cached-input pricing can turn a seemingly expensive workflow into a reasonable one. That is especially true for:
- RAG apps
- copilots with stable system prompts
- tools that reuse the same schemas or instructions
- multi-turn assistants that keep a repeated prefix
If you are evaluating Google right now, our Gemini pricing guide and Google free-tier update article are the two best places to start.
The 5 Budgeting Mistakes That Hurt Most
1. Ignoring output length
A model that is cheap on input can still be expensive if it talks too much. This is one reason DeepSeek can look surprisingly strong in cost comparisons: its output rate is unusually low.
If you are forecasting spend, do not stop at prompt size. Measure average completion length too.
2. Treating all input as uncached
Many teams budget using only the standard input rate. That can overstate costs for apps with repeated instructions, but it can also hide engineering opportunity.
If you can make your prefix stable, your real cost often drops materially.
3. Forgetting conversation growth
Chat apps get more expensive over time because history accumulates. A request that starts at 1,500 input tokens can turn into 15,000 after a long session.
Budgeting per request without budgeting for session length is a classic mistake.
4. Sending raw tool output back to the model
Logs, JSON blobs, stack traces, and scraped HTML are token-heavy. They can dominate the input side before you notice.
Good practice:
- summarize tool output before reinserting it
- strip irrelevant fields from JSON
- cap retrieval chunks
- avoid sending the same large block twice
5. Not adding a safety buffer
Your real bill will almost never match your spreadsheet exactly. Retries, fallbacks, longer-than-expected answers, and prompt revisions all add variance.
A good rule of thumb is to add a 10% to 30% buffer to your initial forecast.
A Simple Forecasting Process for New Apps
If you are launching something new, use this five-step workflow:
1. Measure one realistic request
Count:
- uncached input tokens
- cached/repeated input tokens
- output tokens
2. Estimate monthly volume
Multiply by expected requests, sessions, or active users.
3. Run the formula for two or three models
Do not model only your favorite provider. Compare at least one premium model and one budget model.
4. Stress-test long sessions
Model the 90th percentile case, not just the happy path.
5. Re-price after the first week
Once you have production logs, recalculate with real usage rather than estimates.
That last step is where most cost surprises get fixed.
When Batch Discounts Matter
If your workload is asynchronous, overnight backfills, report generation, large-scale classification, batch discounts can cut costs dramatically.
For example, a workload costing $41.25 at standard GPT-5.4 mini rates would drop to about $20.63 with a 50% batch discount.
Batch is not useful for real-time chat, but it is excellent for:
- nightly content generation
- large document tagging jobs
- analytics summaries
- CRM enrichment
- support-ticket classification
This is one reason it is worth separating interactive traffic from offline traffic in your budget model.
Which Model Is Cheapest for Cost Forecasting?
There is no single winner, because the answer depends on your token shape.
- Short prompts, long answers: DeepSeek often looks strongest because output is so cheap.
- Coding and agent workflows: GPT-5.4 mini can be the safer baseline because it balances price and capability well.
- High-quality code generation: Claude Sonnet may justify its premium, but only if the quality lift actually saves engineering time.
- RAG with stable prompts: Gemini Flash tiers can look much better once cached input is included.
The important point is this: do the math with your own token mix. Provider rankings change once you model real usage.
FAQ
How much should I budget before launch?
For a new product, I would usually model three scenarios: expected usage, strong adoption, and a bad-case high-output scenario. Then add a 10% to 30% buffer.
Is per-request pricing a good enough estimate?
Only for very simple apps. For chat, agent, and RAG products, session growth and repeated context matter too much.
Are cached tokens always automatic?
No. Caching behavior varies by provider and model family. Always confirm how your chosen API handles repeated prefixes and what discounted rate actually applies.
What is the fastest way to compare providers?
Use our calculator for quick checks, then verify the exact model on the provider page: OpenAI, Anthropic, Google, or DeepSeek.
Bottom Line
If you can estimate three numbers, uncached input, cached input, and output, you can forecast your AI API costs well enough to make smart decisions.
Most teams do not need a perfect finance model. They need a repeatable pricing habit:
- measure tokens
- split input from output
- account for caching
- compare at least two providers
- update the model with real production data
That is enough to avoid most expensive mistakes.
For more help, try the token calculator, compare all current models on our pricing hub, and read our best AI models in 2026 guide if you are still deciding which tier to use.