How to Calculate AI API Costs (2026)

Most teams do not overspend on AI because the per-token rates are confusing. They overspend because they guess.

The fix is simple: break your usage into input tokens, cached input tokens, and output tokens, then multiply each by the model’s price.

This guide shows the exact formula, the mistakes that blow up budgets, and a few worked examples using current April 2026 pricing from OpenAI, Anthropic, Google, and DeepSeek.

If you want the short version, here it is:

AI API cost = (input tokens × input rate) + (cached input tokens × cached rate) + (output tokens × output rate)

Everything else is just getting the right token counts.

Step 1: Know the Three Token Buckets

Every serious API bill is driven by three numbers:

Input tokens: what you send to the model
Cached input tokens: repeated prompt/context that gets discounted
Output tokens: what the model sends back

This matters because the three buckets are priced differently.

Model	Input / 1M	Cached Input / 1M	Output / 1M
GPT-5.4 mini	$0.75	$0.075	$4.50
Claude Sonnet 4.6	$3.00	$0.30	$15.00
Gemini 3.1 Flash-Lite	$0.25	$0.03	$1.50
DeepSeek V3.2	$0.28	$0.028	$0.42

Two quick takeaways:

Output is usually the expensive part. If your app generates long answers, your budget can spike fast.
Caching changes the math. Reused system prompts, long instructions, and stable context often get dramatically cheaper on repeat calls.

For current model-by-model pricing, see our OpenAI pricing page, Anthropic pricing page, Google AI pricing page, and DeepSeek pricing page.

Step 2: Use the Core Formula

Here is the formula in plain English:

Monthly cost =

(uncached input tokens / 1,000,000 × input price)
+ (cached input tokens / 1,000,000 × cached input price)
+ (output tokens / 1,000,000 × output price)

Or as shorthand:

cost = (input × in_rate) + (cached × cache_rate) + (output × out_rate)

If your provider gives you a batch discount, apply that after the normal calculation for the batch-eligible traffic.

Step 3: Estimate Tokens Per Request

You do not need perfect forecasting on day one. You just need a reasonable baseline.

A good starting point:

1 token ≈ 0.75 English words
1,000 tokens ≈ 750 words
Code, JSON, logs, and tool outputs often use more tokens than plain English

For budgeting, measure these separately:

system prompt
retrieved context or attached docs
conversation history
tool output
final answer

That last one matters more than most teams expect. A model that writes 1,200-token answers instead of 250-token answers can multiply your bill even if the input stays the same.

If you need a refresher on token basics, read What Is a Token in AI?. If you want to test your own numbers, use our token calculator or the more detailed token cost calculator.

Worked Example 1: A Cheap Support Bot

Let’s say your support bot handles a month of traffic with:

50 million input tokens
10 million output tokens
no meaningful caching yet

GPT-5.4 nano

Input: 50M × $0.20 = $10.00
Output: 10M × $1.25 = $12.50
Total: $22.50/month

DeepSeek V3.2

Input: 50M × $0.28 = $14.00
Output: 10M × $0.42 = $4.20
Total: $18.20/month

Claude Haiku 4.5

Input: 50M × $1.00 = $50.00
Output: 10M × $5.00 = $50.00
Total: $100.00/month

What this tells you

If your workload is simple and high volume, sticker-price differences compound quickly. Claude Haiku may still be the right choice if quality is better for your use case, but the budget gap is real. For a pure FAQ or routing bot, DeepSeek or GPT nano is usually the first place to start.

For a broader cheap-model ranking, see our cheapest AI API guide.

Worked Example 2: A Coding Assistant with Cached Context

Now let’s model a more expensive app: a coding assistant with lots of repeated repo context.

Monthly usage:

30M uncached input tokens
70M cached input tokens
20M output tokens

GPT-5.4 mini

Uncached input: 30M × $0.75 = $22.50
Cached input: 70M × $0.075 = $5.25
Output: 20M × $4.50 = $90.00
Total: $117.75/month

Claude Sonnet 4.6

Uncached input: 30M × $3.00 = $90.00
Cached input: 70M × $0.30 = $21.00
Output: 20M × $15.00 = $300.00
Total: $411.00/month

What this tells you

Caching helps both models, but it does not erase the output-cost gap. In this example, output tokens dominate the bill.

That is why many teams end up with a hybrid setup:

use Claude Sonnet for the hardest coding tasks
route simpler edits, summaries, and tests to cheaper models
trim output length aggressively

We break down the tradeoff in more detail in OpenAI vs Anthropic API Pricing (2026) and Best AI API for Developers.

Worked Example 3: A RAG App with Heavy Reuse

RAG and document-analysis apps often have a lot of repeated instructions and retrieval scaffolding.

Monthly usage:

40M uncached input tokens
160M cached input tokens
15M output tokens

Gemini 2.5 Flash

Uncached input: 40M × $0.30 = $12.00
Cached input: 160M × $0.03 = $4.80
Output: 15M × $2.50 = $37.50
Total: $54.30/month

Why this matters

When you can keep prompts stable and reuse them, cached-input pricing can turn a seemingly expensive workflow into a reasonable one. That is especially true for:

RAG apps
copilots with stable system prompts
tools that reuse the same schemas or instructions
multi-turn assistants that keep a repeated prefix

If you are evaluating Google right now, our Gemini pricing guide and Google free-tier update article are the two best places to start.

The 5 Budgeting Mistakes That Hurt Most

1. Ignoring output length

A model that is cheap on input can still be expensive if it talks too much. This is one reason DeepSeek can look surprisingly strong in cost comparisons: its output rate is unusually low.

If you are forecasting spend, do not stop at prompt size. Measure average completion length too.

2. Treating all input as uncached

Many teams budget using only the standard input rate. That can overstate costs for apps with repeated instructions, but it can also hide engineering opportunity.

If you can make your prefix stable, your real cost often drops materially.

3. Forgetting conversation growth

Chat apps get more expensive over time because history accumulates. A request that starts at 1,500 input tokens can turn into 15,000 after a long session.

Budgeting per request without budgeting for session length is a classic mistake.

4. Sending raw tool output back to the model

Logs, JSON blobs, stack traces, and scraped HTML are token-heavy. They can dominate the input side before you notice.

Good practice:

summarize tool output before reinserting it
strip irrelevant fields from JSON
cap retrieval chunks
avoid sending the same large block twice

5. Not adding a safety buffer

Your real bill will almost never match your spreadsheet exactly. Retries, fallbacks, longer-than-expected answers, and prompt revisions all add variance.

A good rule of thumb is to add a 10% to 30% buffer to your initial forecast.

A Simple Forecasting Process for New Apps

If you are launching something new, use this five-step workflow:

1. Measure one realistic request

Count:

uncached input tokens
cached/repeated input tokens
output tokens

2. Estimate monthly volume

Multiply by expected requests, sessions, or active users.

3. Run the formula for two or three models

Do not model only your favorite provider. Compare at least one premium model and one budget model.

4. Stress-test long sessions

Model the 90th percentile case, not just the happy path.

5. Re-price after the first week

Once you have production logs, recalculate with real usage rather than estimates.

That last step is where most cost surprises get fixed.

When Batch Discounts Matter

If your workload is asynchronous, overnight backfills, report generation, large-scale classification, batch discounts can cut costs dramatically.

For example, a workload costing $41.25 at standard GPT-5.4 mini rates would drop to about $20.63 with a 50% batch discount.

Batch is not useful for real-time chat, but it is excellent for:

nightly content generation
large document tagging jobs
analytics summaries
CRM enrichment
support-ticket classification

This is one reason it is worth separating interactive traffic from offline traffic in your budget model.

Which Model Is Cheapest for Cost Forecasting?

There is no single winner, because the answer depends on your token shape.

Short prompts, long answers: DeepSeek often looks strongest because output is so cheap.
Coding and agent workflows: GPT-5.4 mini can be the safer baseline because it balances price and capability well.
High-quality code generation: Claude Sonnet may justify its premium, but only if the quality lift actually saves engineering time.
RAG with stable prompts: Gemini Flash tiers can look much better once cached input is included.

The important point is this: do the math with your own token mix. Provider rankings change once you model real usage.

FAQ

How much should I budget before launch?

For a new product, I would usually model three scenarios: expected usage, strong adoption, and a bad-case high-output scenario. Then add a 10% to 30% buffer.

Is per-request pricing a good enough estimate?

Only for very simple apps. For chat, agent, and RAG products, session growth and repeated context matter too much.

Are cached tokens always automatic?

No. Caching behavior varies by provider and model family. Always confirm how your chosen API handles repeated prefixes and what discounted rate actually applies.

What is the fastest way to compare providers?

Use our calculator for quick checks, then verify the exact model on the provider page: OpenAI, Anthropic, Google, or DeepSeek.

Bottom Line

If you can estimate three numbers, uncached input, cached input, and output, you can forecast your AI API costs well enough to make smart decisions.

Most teams do not need a perfect finance model. They need a repeatable pricing habit:

measure tokens
split input from output
account for caching
compare at least two providers
update the model with real production data

That is enough to avoid most expensive mistakes.

For more help, try the token calculator, compare all current models on our pricing hub, and read our best AI models in 2026 guide if you are still deciding which tier to use.