AI Pricing Week in Review: June 1-4, 2026
AI pricing week in review: Gemma 4 12B, OpenAI on AWS, Uber AI coding caps, and hardware cost pressure from AI demand.
By AI Pricing Guru Editorial Team
AI Pricing Guru articles are maintained by the editorial workflow behind the site: daily pricing snapshots, provider source checks, and review passes for model launches, subscription limits, and billing changes.
The most useful AI pricing news this week was not a simple price cut. It was a set of signals about where AI costs are moving: more local multimodal models, more enterprise procurement through cloud platforms, tighter AI coding budgets, and more pressure on the hardware supply chain.
That matters because buyers are now past the stage where “which model is cheapest per token?” is enough. The better question is which part of the workflow should pay for premium API calls, which part can run on cheaper hosted models, and which part can move to local hardware.
This short week gave four practical signals:
| Story | What changed | Pricing impact |
|---|---|---|
| Google introduced Gemma 4 12B | A mid-sized multimodal model with native audio and vision, designed for laptops with 16GB of memory | More multimodal prototyping and some production work can avoid per-token API bills |
| OpenAI expanded frontier models and Codex on AWS | OpenAI capabilities are available through AWS environments, including Codex on Amazon Bedrock | Procurement, governance, and cloud commitments may matter as much as list token prices |
| Uber capped AI coding tool spend | Bloomberg reported a $1,500 monthly token-spend cap per AI coding tool for employees | Enterprise coding-agent budgets are becoming explicit, measurable, and limited |
| AI demand squeezed hardware pricing | DDR5 pricing jumped sharply as AI infrastructure demand competes for memory supply | Local inference is attractive, but hardware cost has to be included in the model-routing decision |
1. Gemma 4 12B makes local multimodal AI more realistic
Google introduced Gemma 4 12B on June 3. The model sits between smaller edge-friendly Gemma models and the larger 26B Mixture-of-Experts tier. The important pricing angle is not a hosted API discount. It is that Google is pushing useful multimodal capability into hardware many developers already own.
Google says Gemma 4 12B is:
- a unified multimodal model without separate vision or audio encoders
- capable of native audio and vision input
- close to the larger 26B model on standard benchmarks
- small enough to run locally with 16GB of VRAM or unified memory
- released under Apache 2.0
- available through tools such as LM Studio, Ollama, Hugging Face, llama.cpp, MLX, SGLang, vLLM, Google Cloud, Cloud Run, and GKE
For pricing, the local angle is the main event. Every task that can run acceptably on a local Gemma model is a task that does not need a hosted frontier model call.
That does not make Gemma free. Local inference has real costs:
- developer hardware
- GPU or high-memory laptop refresh cycles
- electricity
- operational support
- slower throughput than managed API endpoints in some cases
- engineering time for deployment, monitoring, and evaluation
But for repeated internal workflows, the economics can still be attractive. If a team is processing meeting audio, screenshots, support images, internal docs, app logs, or lightweight agent tasks, a local 12B multimodal model can become the first pass before escalation to Gemini Pro, GPT-5.5, Claude Opus, or another premium route.
The likely cost pattern is:
- Run cheap local multimodal classification or extraction first.
- Escalate uncertain cases to a hosted model.
- Use the premium model only where judgment, depth, or reliability justifies the bill.
Current tracked hosted prices show why that routing decision matters:
| Hosted model | Input / 1M | Cached input / 1M | Output / 1M | Practical role |
|---|---|---|---|---|
| Gemini 3.1 Flash-Lite | $0.25 | $0.025 | $1.50 | High-volume Gemini route |
| Gemini 3 Flash | $0.50 | $0.05 | $3.00 | Fast production work |
| Gemini 3 Pro | $2.00 | $0.20 | $12.00 | Strong flagship reasoning |
| GPT-5.4 mini | $0.75 | $0.075 | $4.50 | Cheap OpenAI production route |
| GPT-5.5 | $5.00 | $0.50 | $30.00 | Premium reasoning and agent steps |
If local Gemma 4 12B can replace even a slice of repetitive multimodal requests, it can materially reduce hosted inference spend. If it only handles weak first-pass triage and sends most requests onward, the savings will be smaller.
The buyer move: benchmark local Gemma against one real workflow, not a generic demo. Measure accuracy, latency, hardware cost, engineering time, and avoided API tokens together.
For broader Google rates, use our Google Gemini pricing page and model your own workload in the AI token calculator.
2. OpenAI on AWS is a procurement change, not a confirmed price cut
OpenAI also announced that frontier models and Codex are available through AWS environments. The practical enterprise pitch is clear: OpenAI can now move through familiar AWS security, compliance, procurement, billing, and governance workflows.
That matters. Large companies often do not block AI adoption because a model is unavailable. They block it because billing, vendor approval, security review, logging, data handling, and access controls are not ready.
OpenAI on AWS changes that buying path. Codex on Amazon Bedrock is especially relevant for software teams because AI coding agents can burn tokens quickly. A single coding workflow may inspect files, plan, call tools, retry edits, write tests, summarize changes, and run follow-up checks. That token loop is very different from a single chat completion.
The pricing nuance: this is not yet a reason to assume OpenAI got cheaper per token. Treat it as a distribution and procurement change unless AWS or OpenAI publishes a separate rate card for the exact model and region you plan to use.
Current tracked OpenAI public prices remain the baseline:
| OpenAI model | Input / 1M | Cached input / 1M | Output / 1M | Budget role |
|---|---|---|---|---|
| GPT-5.5 | $5.00 | $0.50 | $30.00 | Premium reasoning, hard coding, high-value agent work |
| GPT-5.4 | $2.50 | $0.25 | $15.00 | Quality-sensitive default |
| GPT-5.4 mini | $0.75 | $0.075 | $4.50 | Routine production tasks |
| GPT-5.4 nano | $0.20 | $0.02 | $1.25 | Classification, tagging, background work |
AWS availability can still improve effective economics in three ways:
- existing AWS commitments may absorb or simplify spend
- procurement and security review may move faster
- centralized logging and governance may reduce operational overhead
Those are real business benefits, but they are not the same as lower unit prices. Finance teams should separate token cost from procurement value.
If your company tests OpenAI through AWS, set up reporting before broad adoption:
- tokens by model
- cached input share
- output tokens per completed task
- Codex run duration
- retries and failed runs
- spend by team, repo, and workflow
- cost per merged pull request or resolved ticket
For live OpenAI rates, use our OpenAI pricing page and the deeper OpenAI API pricing guide.
3. Uber’s AI coding cap is a useful enterprise benchmark
The week’s clearest budget number came from Uber. Bloomberg reported, and Simon Willison highlighted, that Uber is limiting employees to $1,500 in monthly token spending per AI coding tool. The cap applies to agentic coding software such as Cursor and Claude Code, and the limit is per tool rather than one combined AI budget.
That number is useful because it gives buyers a real enterprise reference point. A $1,500 monthly cap is not a casual software subscription. It is a serious per-user productivity budget.
If an engineer actively uses two capped tools, the theoretical annual ceiling is:
| Assumption | Annualized cap |
|---|---|
| One AI coding tool at $1,500/month | $18,000 per engineer |
| Two AI coding tools at $1,500/month each | $36,000 per engineer |
| Ten engineers using two tools at cap | $360,000 per year |
| One hundred engineers using two tools at cap | $3.6 million per year |
The right conclusion is not that every company should copy Uber’s number. The useful point is that coding-agent spend has become large enough to need explicit policy.
Teams should stop treating coding AI as a flat subscription category. The new buying unit is closer to:
- cost per developer per month
- cost per accepted code change
- cost per merged pull request
- cost per incident fixed
- cost per test failure resolved
- cost per engineering hour saved
That also changes model routing. Claude Code, Cursor, Codex, GitHub Copilot, and other coding agents can all be valuable, but they do not need the most expensive model for every step. Repository search, file summarization, boilerplate edits, test-output classification, and lint-fix loops can often run on cheaper models than high-level architecture planning or hard debugging.
If you are buying coding tools for a team, set budget guardrails early:
- define caps by role and workflow
- separate experimental use from production engineering use
- track token spend by project
- compare plan pricing against API passthrough pricing
- review whether multiple tools duplicate the same cost center
- measure accepted output, not just generated output
This is also where provider pages matter. Compare model economics across Anthropic pricing, OpenAI pricing, and GitHub Copilot pricing before choosing a default coding stack.
4. Hardware pressure complicates local AI economics
The local-inference story has a second side. Tom’s Hardware reported that 32GB DDR5 memory pricing has climbed to around $375 minimum, with AI demand continuing to squeeze the PC-building market.
That is not an API price announcement, but it belongs in an AI pricing roundup because local AI is not free just because tokens disappear from the invoice.
When teams compare hosted API calls against local models, they should include hardware sensitivity:
| Cost bucket | Hosted API route | Local inference route |
|---|---|---|
| Marginal usage | Per-token or per-request bill | Mostly electricity and hardware utilization |
| Upfront spend | Low | Higher hardware or GPU commitment |
| Scaling | Provider handles capacity | Team handles hardware, queues, and upgrades |
| Governance | Provider and cloud controls | Internal controls and deployment process |
| Latency | Depends on provider and region | Can be excellent for local interactive use |
| Cost risk | Token loops, retries, output length | Hardware shortages, underutilization, maintenance |
Gemma 4 12B makes local multimodal AI more accessible, but if memory and GPU prices keep rising, the breakeven point moves. A team with unused developer laptops may have a very different answer from a team that needs to buy new high-memory machines.
The practical move is to calculate total cost per workflow:
- Estimate hosted API cost using real input, output, cache, retry, and escalation rates.
- Estimate local cost using hardware purchase price, expected lifespan, utilization, support time, and electricity.
- Add quality cost: human review, failed tasks, retries, and latency.
- Recheck the math quarterly because both token prices and hardware prices move.
For many teams, the best answer will be hybrid. Local models handle private, repetitive, or latency-sensitive first-pass tasks. Hosted models handle high-value reasoning, customer-facing quality, and large-scale burst capacity.
Where to look first this week
If you run multimodal workflows, test Gemma 4 12B against one specific workload before buying more hosted capacity. Focus on tasks where good-enough local extraction or classification reduces premium model calls.
If your company is AWS-first, evaluate OpenAI on AWS as a governance and procurement path. Do not assume a price cut until you see the exact Bedrock or AWS commercial terms.
If your engineers use coding agents, create a real budget model now. Uber’s cap is a reminder that agentic coding is not a $20/month category at enterprise scale.
If you are moving work local, include hardware inflation in the calculation. Local inference can be cheaper, but memory and GPU constraints are part of the bill.
The week in one read
June 1-4 was a workflow-cost week.
Google’s Gemma 4 12B showed that useful local multimodal AI is moving onto laptops. OpenAI on AWS made frontier models and Codex easier to buy and govern through enterprise cloud channels. Uber’s coding-tool cap made AI agent budgets concrete. Hardware price pressure reminded everyone that local inference has its own economics.
The best cost strategy is not all-local or all-API. It is routing discipline: local or cheap models for repeatable work, cached mid-tier models for production volume, and premium models only where the task earns the spend.
Sources: Google Gemma 4 12B announcement, OpenAI on AWS announcement, Simon Willison on Uber’s AI coding cap, and Tom’s Hardware on DDR5 pricing pressure.