Groq API pricing is built for a different buying decision than OpenAI, Anthropic, or Google. You are not paying for Groq’s own frontier model family. You are paying for very fast hosted inference on open and open-adjacent models such as Llama, Qwen, and GPT OSS variants.
That makes Groq especially interesting for teams with high request volume, latency-sensitive interfaces, or workloads where a smaller open model passes evals. It is also easy to overstate the savings. Groq can be dramatically cheaper than premium frontier models, but the right comparison is the exact model you would otherwise use, not the provider logo.
Using AI Pricing Guru’s tracked pricing data updated on June 30, 2026, Groq’s current public model ladder starts at Llama 3.1 8B Instant at $0.05 input / $0.08 output per million tokens and rises to Llama 3.3 70B Versatile at $0.59 input / $0.79 output. The GPT OSS routes add useful middle tiers, including GPT OSS 20B at $0.075 input / $0.30 output and GPT OSS 120B at $0.15 input / $0.60 output.
For live rates, keep the Groq pricing page, full AI pricing table, and token cost calculator open while modeling your own traffic. For direct provider comparisons, see Groq vs OpenAI pricing, OpenAI pricing, and Anthropic pricing.
Groq API Pricing: Quick Reference
All prices below are USD per 1 million tokens from the current AI Pricing Guru tracker.
| Groq model | Status | Input | Cached input | Output | Best fit |
|---|---|---|---|---|---|
| Llama 3.1 8B Instant | Active | $0.05 | n/a | $0.08 | Cheapest fast route, classification, routing, simple support |
| GPT OSS 20B | Active | $0.075 | $0.0375 | $0.30 | Low-cost general text with cached-input support |
| GPT OSS Safeguard 20B | Preview | $0.075 | n/a | $0.30 | Safety and moderation-style checks |
| Llama 4 Scout 17B 16E Instruct | Preview | $0.11 | n/a | $0.34 | Fast preview Llama 4 workloads |
| GPT OSS 120B | Active | $0.15 | $0.075 | $0.60 | Larger general text route at low cost |
| Qwen3 32B | Preview | $0.29 | n/a | $0.59 | Coding, multilingual, structured text experiments |
| Llama 3.3 70B Versatile | Active | $0.59 | n/a | $0.79 | Stronger open model when small routes fail |
The headline takeaways:
- Groq’s cheapest tracked model is very cheap: $0.05 input and $0.08 output per million tokens.
- GPT OSS 20B and GPT OSS 120B are the most useful middle tiers because they combine low prices with broader model capacity.
- Only the GPT OSS 20B and 120B rows currently show tracked cached-input pricing in our data.
- Preview routes can be attractive, but production teams should watch availability, rate limits, and behavior changes.
- Groq’s biggest advantage is not only token price. It is low-latency inference for workloads where speed changes the user experience.
Which Groq Model Should You Use?
Use Llama 3.1 8B Instant for cheap utility calls
Llama 3.1 8B Instant is Groq’s lowest-cost tracked model at $0.05 input / $0.08 output per million tokens. It is the first route to test when you need speed, volume, and predictable output formats more than premium reasoning.
Good fits include:
- intent detection
- basic classification
- short summaries
- query rewriting
- title and metadata generation
- low-risk support drafts
- routing before a stronger model
- first-pass extraction from clean text
This is not the model to trust with difficult reasoning, nuanced policy interpretation, or high-value final answers. Its job is to keep infrastructure calls cheap and fast. If the task is easy to check or can escalate when confidence is low, start here.
Use GPT OSS 20B for stronger low-cost text
GPT OSS 20B costs $0.075 input, $0.0375 cached input, and $0.30 output per million tokens in the current tracker. It is still inexpensive, but it gives you a larger route than Llama 3.1 8B Instant.
This is a sensible default for:
- support response drafts
- structured summaries
- lightweight RAG answers
- content classification with explanations
- internal workflow automation
- simple coding triage
- batch cleanup and transformation
The cached-input rate matters if you repeatedly send the same system prompt, policy, schema, retrieval scaffold, or tool instructions. At $0.0375 per million cached input tokens, repeated context can become almost negligible compared with output.
Use GPT OSS 120B when quality matters more
GPT OSS 120B sits at $0.15 input, $0.075 cached input, and $0.60 output. That puts it in the same tracked token-price band as some low-cost proprietary models, while keeping Groq’s speed advantage in play.
Use it when 20B is not reliable enough but you still want to avoid jumping to a frontier model:
- higher-quality support drafts
- agent planning steps
- code explanation and triage
- dense document summaries
- search answer synthesis
- data cleaning with more edge cases
- first-pass analysis before escalation
GPT OSS 120B is often the practical “try this before a premium model” tier. If it passes your evals, the cost difference versus GPT-5.4, Claude Opus, or other frontier routes can be large.
Use Qwen3 32B and Llama 4 Scout for experiments
Qwen3 32B is priced at $0.29 input / $0.59 output, while Llama 4 Scout 17B 16E Instruct is $0.11 input / $0.34 output. Both are preview in the current tracker, so treat them as candidates to benchmark rather than permanent defaults.
Qwen3 32B is worth testing for coding, multilingual text, structured output, and technical workflows. Llama 4 Scout is worth testing when you want a newer Llama-family route with low input cost and very fast responses.
Preview status does not mean “do not use.” It means you should verify behavior, latency, rate limits, and output quality before relying on the model for customer-facing production traffic.
Use Llama 3.3 70B Versatile for stronger open-model quality
Llama 3.3 70B Versatile is the most expensive Groq model in the current tracked table at $0.59 input / $0.79 output. It can still be inexpensive compared with premium frontier models, but it is no longer the ultra-cheap layer.
Choose it when:
- 8B and 20B routes fail quality checks
- you want stronger open-model behavior
- output accuracy matters more than the lowest bill
- latency still matters
- you want a non-frontier fallback before using OpenAI, Anthropic, or Google
For many teams, Llama 3.3 70B is an escalation route inside the Groq stack. It should not automatically be the first route for every request.
Example Monthly Costs
Token price becomes easier to reason about when you model a real workload. Assume a support and internal automation product uses 100 million input tokens and 50 million output tokens per month.
| Groq model | Monthly token cost |
|---|---|
| Llama 3.1 8B Instant | $9.00 |
| GPT OSS 20B | $22.50 |
| GPT OSS Safeguard 20B | $22.50 |
| Llama 4 Scout 17B 16E Instruct | $28.00 |
| GPT OSS 120B | $45.00 |
| Qwen3 32B | $58.50 |
| Llama 3.3 70B Versatile | $98.50 |
The same traffic pattern can cost under $10 per month on the cheapest Groq route, under $50 on GPT OSS 120B, or just under $100 on Llama 3.3 70B. That spread is why model routing matters.
Now assume a heavier extraction workload with 1 billion input tokens and 100 million output tokens per month.
| Groq model | Monthly token cost |
|---|---|
| Llama 3.1 8B Instant | $58.00 |
| GPT OSS 20B | $105.00 |
| GPT OSS Safeguard 20B | $105.00 |
| Llama 4 Scout 17B 16E Instruct | $144.00 |
| GPT OSS 120B | $210.00 |
| Qwen3 32B | $349.00 |
| Llama 3.3 70B Versatile | $669.00 |
For extraction-heavy systems, input price dominates. Llama 3.1 8B Instant is extremely hard to beat if it produces valid structured output. If it fails too often, GPT OSS 20B and GPT OSS 120B are still cheap escalation steps.
Hidden Groq Costs to Watch
Output tokens can erase cheap-input savings
Groq’s input prices are low, but output still matters. A model at $0.15 input / $0.60 output is four times more expensive on output than input. If your app generates long answers, verbose explanations, markdown tables, or repeated summaries, output can become the larger line item.
Control output length with concise prompts, clear schemas, and strict response budgets. For extraction, ask for compact JSON. For support drafts, ask for a reply that fits the channel.
Quality failures have a cost
The cheapest model is not cheapest if it requires retries, human cleanup, or frequent escalation. A failed classification that sends a customer to the wrong workflow can cost more than the token bill. A bad code patch can cost more than a premium model call.
Measure cost per accepted result, not just cost per million tokens. Run a representative eval set before moving production traffic.
Preview models need monitoring
Preview models can be useful, but they deserve closer operational monitoring. Track output quality, latency, refusal behavior, JSON validity, and rate-limit behavior. If a preview route is part of your production stack, keep an active fallback route in the same provider or another provider.
Cached input is not universal
In the current tracker, cached-input rates are present for GPT OSS 20B and GPT OSS 120B, not every Groq model. If your workload depends on repeated large context, test the exact model and API path instead of assuming every route has the same cache economics.
Best Groq Model by Use Case
| Use case | Start with | Escalate to |
|---|---|---|
| Intent routing and tagging | Llama 3.1 8B Instant | GPT OSS 20B |
| Basic support drafts | GPT OSS 20B | GPT OSS 120B |
| High-volume extraction | Llama 3.1 8B Instant | GPT OSS 20B or 120B |
| Coding triage | GPT OSS 120B or Qwen3 32B | OpenAI GPT-5.4 mini or Claude |
| Realtime search assistant | Llama 4 Scout or GPT OSS 20B | GPT OSS 120B |
| Safety checks | GPT OSS Safeguard 20B | A dedicated moderation or policy route |
| Stronger open-model answers | Llama 3.3 70B Versatile | OpenAI, Anthropic, or Google premium model |
The best Groq setup is usually layered:
- Use Llama 3.1 8B Instant for cheap, fast, easily checked tasks.
- Use GPT OSS 20B as the low-cost default for general text.
- Use GPT OSS 120B or Qwen3 32B when quality needs to rise.
- Use Llama 3.3 70B Versatile when you want stronger open-model behavior.
- Escalate to OpenAI, Anthropic, Google, or DeepSeek only when the Groq route fails the task.
Groq vs Other Providers
Compared with OpenAI, Groq’s advantage is low-cost, low-latency open-model inference. OpenAI’s advantage is a deeper proprietary model ladder, multimodal breadth, tool behavior, enterprise familiarity, and higher quality ceilings. The practical strategy is often Groq for bulk traffic and OpenAI for escalation.
Compared with Anthropic, Groq is much cheaper on listed token price for many text workloads. Claude can still win when writing quality, coding judgment, long-context reasoning, or agent behavior matters enough to reduce retries.
Compared with Google Gemini, Groq is usually stronger as a fast inference host for open models, while Gemini is stronger when you want Google’s multimodal stack, long-context behavior, and Vertex AI procurement path.
Compared with DeepSeek, Groq competes on speed and hosted inference. DeepSeek can win on raw low-cost model pricing in some routes, especially if its own models pass your evals. Groq wins when the specific hosted model and latency profile fit the product better.
FAQ
What is the cheapest Groq API model?
Llama 3.1 8B Instant is the cheapest Groq model in the current AI Pricing Guru tracker at $0.05 per million input tokens and $0.08 per million output tokens.
Does Groq have cached input pricing?
Some tracked Groq routes do. GPT OSS 20B shows $0.0375 cached input per million tokens, and GPT OSS 120B shows $0.075 cached input. Other Groq rows in the current tracker do not show cached-input pricing.
Is Groq cheaper than OpenAI?
Often, yes, especially compared with OpenAI’s premium GPT-5.4 and GPT-5.5 tiers. But OpenAI’s smallest models can be competitive for simple workloads. Compare the exact model pair with the token calculator.
Can Groq replace frontier models?
Not for every task. Groq can replace many bulk text, routing, classification, extraction, and low-risk assistant calls when open models pass evals. Hard reasoning, sensitive outputs, premium agents, and complex coding may still need OpenAI, Anthropic, Google, or another stronger route.
Bottom Line
Groq is best when speed and cost both matter, the workload is text-first, and success can be measured with evals. Start with the cheapest route that passes, then add stronger Groq models and premium-provider escalation only where needed.
For most teams, Groq should be a routing layer, not a religion. Use it to absorb cheap, fast, high-volume work. Keep stronger models available for hard or risky requests. Then model your actual traffic with the AI token cost calculator against the live Groq pricing page.
Last updated: June 30, 2026, using AI Pricing Guru’s tracked pricing data.