Groq API Pricing Guide 2026

Groq API pricing is built for a different buying decision than OpenAI, Anthropic, or Google. You are not paying for Groq’s own frontier model family. You are paying for very fast hosted inference on open and open-adjacent models such as Llama, Qwen, and GPT OSS variants.

That makes Groq especially interesting for teams with high request volume, latency-sensitive interfaces, or workloads where a smaller open model passes evals. It is also easy to overstate the savings. Groq can be dramatically cheaper than premium frontier models, but the right comparison is the exact model you would otherwise use, not the provider logo.

Using AI Pricing Guru’s tracked pricing data updated on June 30, 2026, Groq’s current public model ladder starts at Llama 3.1 8B Instant at $0.05 input / $0.08 output per million tokens and rises to Llama 3.3 70B Versatile at $0.59 input / $0.79 output. The GPT OSS routes add useful middle tiers, including GPT OSS 20B at $0.075 input / $0.30 output and GPT OSS 120B at $0.15 input / $0.60 output.

For live rates, keep the Groq pricing page, full AI pricing table, and token cost calculator open while modeling your own traffic. For direct provider comparisons, see Groq vs OpenAI pricing, OpenAI pricing, and Anthropic pricing.

Groq API Pricing: Quick Reference

All prices below are USD per 1 million tokens from the current AI Pricing Guru tracker.

Groq model	Status	Input	Cached input	Output	Best fit
Llama 3.1 8B Instant	Active	$0.05	n/a	$0.08	Cheapest fast route, classification, routing, simple support
GPT OSS 20B	Active	$0.075	$0.0375	$0.30	Low-cost general text with cached-input support
GPT OSS Safeguard 20B	Preview	$0.075	n/a	$0.30	Safety and moderation-style checks
Llama 4 Scout 17B 16E Instruct	Preview	$0.11	n/a	$0.34	Fast preview Llama 4 workloads
GPT OSS 120B	Active	$0.15	$0.075	$0.60	Larger general text route at low cost
Qwen3 32B	Preview	$0.29	n/a	$0.59	Coding, multilingual, structured text experiments
Llama 3.3 70B Versatile	Active	$0.59	n/a	$0.79	Stronger open model when small routes fail

The headline takeaways:

Groq’s cheapest tracked model is very cheap: $0.05 input and $0.08 output per million tokens.
GPT OSS 20B and GPT OSS 120B are the most useful middle tiers because they combine low prices with broader model capacity.
Only the GPT OSS 20B and 120B rows currently show tracked cached-input pricing in our data.
Preview routes can be attractive, but production teams should watch availability, rate limits, and behavior changes.
Groq’s biggest advantage is not only token price. It is low-latency inference for workloads where speed changes the user experience.

Which Groq Model Should You Use?

Use Llama 3.1 8B Instant for cheap utility calls

Llama 3.1 8B Instant is Groq’s lowest-cost tracked model at $0.05 input / $0.08 output per million tokens. It is the first route to test when you need speed, volume, and predictable output formats more than premium reasoning.

Good fits include:

intent detection
basic classification
short summaries
query rewriting
title and metadata generation
low-risk support drafts
routing before a stronger model
first-pass extraction from clean text

This is not the model to trust with difficult reasoning, nuanced policy interpretation, or high-value final answers. Its job is to keep infrastructure calls cheap and fast. If the task is easy to check or can escalate when confidence is low, start here.

Use GPT OSS 20B for stronger low-cost text

GPT OSS 20B costs $0.075 input, $0.0375 cached input, and $0.30 output per million tokens in the current tracker. It is still inexpensive, but it gives you a larger route than Llama 3.1 8B Instant.

This is a sensible default for:

support response drafts
structured summaries
lightweight RAG answers
content classification with explanations
internal workflow automation
simple coding triage
batch cleanup and transformation

The cached-input rate matters if you repeatedly send the same system prompt, policy, schema, retrieval scaffold, or tool instructions. At $0.0375 per million cached input tokens, repeated context can become almost negligible compared with output.

Use GPT OSS 120B when quality matters more

GPT OSS 120B sits at $0.15 input, $0.075 cached input, and $0.60 output. That puts it in the same tracked token-price band as some low-cost proprietary models, while keeping Groq’s speed advantage in play.

Use it when 20B is not reliable enough but you still want to avoid jumping to a frontier model:

higher-quality support drafts
agent planning steps
code explanation and triage
dense document summaries
search answer synthesis
data cleaning with more edge cases
first-pass analysis before escalation

GPT OSS 120B is often the practical “try this before a premium model” tier. If it passes your evals, the cost difference versus GPT-5.4, Claude Opus, or other frontier routes can be large.

Use Qwen3 32B and Llama 4 Scout for experiments

Qwen3 32B is priced at $0.29 input / $0.59 output, while Llama 4 Scout 17B 16E Instruct is $0.11 input / $0.34 output. Both are preview in the current tracker, so treat them as candidates to benchmark rather than permanent defaults.

Qwen3 32B is worth testing for coding, multilingual text, structured output, and technical workflows. Llama 4 Scout is worth testing when you want a newer Llama-family route with low input cost and very fast responses.

Preview status does not mean “do not use.” It means you should verify behavior, latency, rate limits, and output quality before relying on the model for customer-facing production traffic.

Use Llama 3.3 70B Versatile for stronger open-model quality

Llama 3.3 70B Versatile is the most expensive Groq model in the current tracked table at $0.59 input / $0.79 output. It can still be inexpensive compared with premium frontier models, but it is no longer the ultra-cheap layer.

Choose it when:

8B and 20B routes fail quality checks
you want stronger open-model behavior
output accuracy matters more than the lowest bill
latency still matters
you want a non-frontier fallback before using OpenAI, Anthropic, or Google

For many teams, Llama 3.3 70B is an escalation route inside the Groq stack. It should not automatically be the first route for every request.

Example Monthly Costs

Token price becomes easier to reason about when you model a real workload. Assume a support and internal automation product uses 100 million input tokens and 50 million output tokens per month.

Groq model	Monthly token cost
Llama 3.1 8B Instant	$9.00
GPT OSS 20B	$22.50
GPT OSS Safeguard 20B	$22.50
Llama 4 Scout 17B 16E Instruct	$28.00
GPT OSS 120B	$45.00
Qwen3 32B	$58.50
Llama 3.3 70B Versatile	$98.50

The same traffic pattern can cost under $10 per month on the cheapest Groq route, under $50 on GPT OSS 120B, or just under $100 on Llama 3.3 70B. That spread is why model routing matters.

Now assume a heavier extraction workload with 1 billion input tokens and 100 million output tokens per month.

Groq model	Monthly token cost
Llama 3.1 8B Instant	$58.00
GPT OSS 20B	$105.00
GPT OSS Safeguard 20B	$105.00
Llama 4 Scout 17B 16E Instruct	$144.00
GPT OSS 120B	$210.00
Qwen3 32B	$349.00
Llama 3.3 70B Versatile	$669.00

For extraction-heavy systems, input price dominates. Llama 3.1 8B Instant is extremely hard to beat if it produces valid structured output. If it fails too often, GPT OSS 20B and GPT OSS 120B are still cheap escalation steps.

Hidden Groq Costs to Watch

Output tokens can erase cheap-input savings

Groq’s input prices are low, but output still matters. A model at $0.15 input / $0.60 output is four times more expensive on output than input. If your app generates long answers, verbose explanations, markdown tables, or repeated summaries, output can become the larger line item.

Control output length with concise prompts, clear schemas, and strict response budgets. For extraction, ask for compact JSON. For support drafts, ask for a reply that fits the channel.

Quality failures have a cost

The cheapest model is not cheapest if it requires retries, human cleanup, or frequent escalation. A failed classification that sends a customer to the wrong workflow can cost more than the token bill. A bad code patch can cost more than a premium model call.

Measure cost per accepted result, not just cost per million tokens. Run a representative eval set before moving production traffic.

Preview models need monitoring

Preview models can be useful, but they deserve closer operational monitoring. Track output quality, latency, refusal behavior, JSON validity, and rate-limit behavior. If a preview route is part of your production stack, keep an active fallback route in the same provider or another provider.

Cached input is not universal

In the current tracker, cached-input rates are present for GPT OSS 20B and GPT OSS 120B, not every Groq model. If your workload depends on repeated large context, test the exact model and API path instead of assuming every route has the same cache economics.

Best Groq Model by Use Case

Use case	Start with	Escalate to
Intent routing and tagging	Llama 3.1 8B Instant	GPT OSS 20B
Basic support drafts	GPT OSS 20B	GPT OSS 120B
High-volume extraction	Llama 3.1 8B Instant	GPT OSS 20B or 120B
Coding triage	GPT OSS 120B or Qwen3 32B	OpenAI GPT-5.4 mini or Claude
Realtime search assistant	Llama 4 Scout or GPT OSS 20B	GPT OSS 120B
Safety checks	GPT OSS Safeguard 20B	A dedicated moderation or policy route
Stronger open-model answers	Llama 3.3 70B Versatile	OpenAI, Anthropic, or Google premium model

The best Groq setup is usually layered:

Use Llama 3.1 8B Instant for cheap, fast, easily checked tasks.
Use GPT OSS 20B as the low-cost default for general text.
Use GPT OSS 120B or Qwen3 32B when quality needs to rise.
Use Llama 3.3 70B Versatile when you want stronger open-model behavior.
Escalate to OpenAI, Anthropic, Google, or DeepSeek only when the Groq route fails the task.

Groq vs Other Providers

Compared with OpenAI, Groq’s advantage is low-cost, low-latency open-model inference. OpenAI’s advantage is a deeper proprietary model ladder, multimodal breadth, tool behavior, enterprise familiarity, and higher quality ceilings. The practical strategy is often Groq for bulk traffic and OpenAI for escalation.

Compared with Anthropic, Groq is much cheaper on listed token price for many text workloads. Claude can still win when writing quality, coding judgment, long-context reasoning, or agent behavior matters enough to reduce retries.

Compared with Google Gemini, Groq is usually stronger as a fast inference host for open models, while Gemini is stronger when you want Google’s multimodal stack, long-context behavior, and Vertex AI procurement path.

Compared with DeepSeek, Groq competes on speed and hosted inference. DeepSeek can win on raw low-cost model pricing in some routes, especially if its own models pass your evals. Groq wins when the specific hosted model and latency profile fit the product better.

FAQ

What is the cheapest Groq API model?

Llama 3.1 8B Instant is the cheapest Groq model in the current AI Pricing Guru tracker at $0.05 per million input tokens and $0.08 per million output tokens.

Does Groq have cached input pricing?

Some tracked Groq routes do. GPT OSS 20B shows $0.0375 cached input per million tokens, and GPT OSS 120B shows $0.075 cached input. Other Groq rows in the current tracker do not show cached-input pricing.

Is Groq cheaper than OpenAI?

Often, yes, especially compared with OpenAI’s premium GPT-5.4 and GPT-5.5 tiers. But OpenAI’s smallest models can be competitive for simple workloads. Compare the exact model pair with the token calculator.

Can Groq replace frontier models?

Not for every task. Groq can replace many bulk text, routing, classification, extraction, and low-risk assistant calls when open models pass evals. Hard reasoning, sensitive outputs, premium agents, and complex coding may still need OpenAI, Anthropic, Google, or another stronger route.

Bottom Line

Groq is best when speed and cost both matter, the workload is text-first, and success can be measured with evals. Start with the cheapest route that passes, then add stronger Groq models and premium-provider escalation only where needed.

For most teams, Groq should be a routing layer, not a religion. Use it to absorb cheap, fast, high-volume work. Keep stronger models available for hard or risky requests. Then model your actual traffic with the AI token cost calculator against the live Groq pricing page.

Last updated: June 30, 2026, using AI Pricing Guru’s tracked pricing data.