OpenAI Launches Realtime Voice Models — Pricing Impact & What It Means (May 2026)

OpenAI launched three new realtime audio models in the API today: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper.

The headline is not just better voices. OpenAI is pushing voice AI from “talk to a bot” into voice agents that can reason, translate, transcribe, call tools, and recover mid-conversation. For pricing teams, the important part is that OpenAI now has a clearer three-model stack for production voice apps:

GPT-Realtime-2 for speech-in, speech-out agents with GPT-5-class reasoning
GPT-Realtime-Translate for live multilingual speech translation
GPT-Realtime-Whisper for streaming speech-to-text

All three are available in the Realtime API now.

OpenAI Realtime Voice Pricing (May 2026)

Model	Price	Best use case
GPT-Realtime-2 audio input	$32 / 1M audio input tokens	Live voice agents that listen continuously
GPT-Realtime-2 cached audio input	$0.40 / 1M cached input tokens	Reused audio/context where caching applies
GPT-Realtime-2 audio output	$64 / 1M audio output tokens	Spoken responses from the agent
GPT-Realtime-2 text input	$4 / 1M tokens	Text context, tools, instructions
GPT-Realtime-2 text output	$24 / 1M tokens	Text responses inside realtime sessions
GPT-Realtime-Translate	$0.034 / minute	Live speech translation
GPT-Realtime-Whisper	$0.017 / minute	Streaming transcription

For the full OpenAI rate card, see our OpenAI API pricing page. You can also model text-token workloads in our token calculator, though minute-based voice models need a separate time-based estimate.

What Changed

GPT-Realtime-2: voice agents with stronger reasoning

GPT-Realtime-2 is the main launch. OpenAI describes it as its first voice model with GPT-5-class reasoning. It is built for live conversations where the model needs to listen, think, use tools, handle interruptions, and respond naturally.

Key changes:

128K context window, up from 32K for prior realtime workflows
Parallel tool calls so agents can check multiple systems while keeping the user informed
Preambles such as “let me check that” while the model works
Better recovery behavior when an action fails or the user changes direction
Stronger domain vocabulary for proper nouns, healthcare terms, specialized products, and support workflows
Adjustable reasoning effort: minimal, low, medium, high, and xhigh, with low as the default

OpenAI says GPT-Realtime-2 at high reasoning scores 15.2% higher on Big Bench Audio than GPT-Realtime-1.5, while the xhigh setting scores 13.8% higher on Audio MultiChallenge. Zillow also reported a 26-point lift in call success rate after prompt optimization on its hardest adversarial benchmark, moving from 69% to 95%.

That matters because voice agents fail differently from chatbots. A chatbot can pause. A voice agent must keep the conversation alive while tools run, users interrupt, names are misheard, and context changes.

GPT-Realtime-Translate: live translation as an API product

GPT-Realtime-Translate translates speech from 70+ input languages into 13 output languages while keeping pace with the speaker.

At $0.034 per minute, the unit economics are easy to understand:

Usage	Translation cost
10 minutes	$0.34
1 hour	$2.04
1,000 hours	$2,040

This is the model to watch for support centers, education platforms, travel apps, creator tools, and live events. If the quality is good enough, many businesses will stop treating multilingual voice support as a separate staffing or localization project and start treating it as an API feature.

GPT-Realtime-Whisper: streaming transcription

GPT-Realtime-Whisper is a new low-latency speech-to-text model that transcribes while someone is speaking.

At $0.017 per minute, the cost is:

Usage	Transcription cost
10 minutes	$0.17
1 hour	$1.02
1,000 hours	$1,020

That pricing targets live captions, meeting notes, call-center summaries, classroom accessibility, broadcast captions, recruiting notes, healthcare intake, and customer-support workflows where waiting until the end of a recording is too slow.

Pricing Impact: Voice Gets a Premium Stack

OpenAI’s new voice pricing makes the product segmentation clearer.

Realtime agents are premium. GPT-Realtime-2 audio tokens cost $32 / 1M input and $64 / 1M output, far above OpenAI’s standard text models. That is expected: realtime audio needs low latency, speech processing, turn-taking, and live output.

Translation and transcription are easier to budget. The two minute-based models are simpler for buyers. A support team can estimate monthly call minutes and multiply by $0.034 or $0.017.

Caching becomes more important. GPT-Realtime-2 cached audio input is listed at $0.40 / 1M tokens, a huge discount from $32 input. If your voice agent repeatedly uses the same instructions, product catalog context, compliance script, or workflow setup, caching strategy can materially change margins.

Compared with text-only models, GPT-Realtime-2 should be reserved for moments where speech is the product experience. If you only need a transcript, use Realtime-Whisper. If you only need a text answer after transcription, route to a cheaper text model like GPT-5.4 mini. If you need live spoken reasoning and actions, GPT-Realtime-2 is the premium path.

For broader model comparisons, see our AI API pricing comparison and best AI API for developers guides.

Who Benefits

Customer support teams benefit first. A voice agent that can call tools, recover gracefully, and explain what it is doing is much closer to production support than a simple speech wrapper around chat.

Travel, real estate, healthcare, and financial services apps also benefit. These are domains where users want to speak naturally, but the system still needs to manage structured actions: check eligibility, schedule appointments, verify details, search inventory, or update an account.

Multilingual businesses get a cleaner route to global support. Realtime translation at $2.04 per hour is not free, but it is cheap compared with staffing every language pair for every support queue.

Accessibility and education products get lower-latency transcription and translation building blocks that can work while a class, meeting, or event is happening.

Who Loses

Standalone transcription vendors face more pressure. At $0.017 per minute, OpenAI is making realtime transcription a commodity input for larger workflows.

Basic voice bot vendors also get squeezed. If a product’s main value is “chatbot plus speech,” GPT-Realtime-2 raises user expectations for reasoning, recovery, tool calls, and natural delivery.

Teams that use premium realtime voice for everything may overspend. Not every audio workflow needs speech-in, speech-out reasoning. Many should be split into transcription, cheaper text reasoning, and optional text-to-speech.

What Developers Should Do Now

1. Route by job, not by hype

Use GPT-Realtime-2 only when the user experience requires a live spoken agent. Use Realtime-Whisper for captions and transcripts. Use Realtime-Translate when the core problem is language transfer.

2. Test reasoning effort settings

The default is low. For simple support flows, that may be enough and should reduce latency. For complex tool use or compliance-sensitive workflows, test medium, high, and xhigh against real call recordings before shipping.

3. Build cost guards into voice agents

Voice sessions can run longer than expected. Add maximum session lengths, escalation triggers, silence handling, and clear handoff rules. A voice agent that politely loops for 30 minutes can become expensive quickly.

4. Cache stable context aggressively

System prompts, policy text, product metadata, and workflow instructions should be structured for reuse. The gap between $32 input and $0.40 cached input is too large to ignore.

Bottom Line

OpenAI’s May 7 voice launch is a major Realtime API upgrade. GPT-Realtime-2 brings stronger reasoning and tool use to live voice agents, while Realtime-Translate and Realtime-Whisper give developers simple minute-based pricing for translation and transcription.

The pricing message is clear: voice agents are premium, transcription and translation are becoming utility layers.

If you are building customer support, travel, healthcare intake, education, accessibility, or multilingual commerce, this launch is worth testing immediately. If your workflow is not truly realtime, route carefully — the cheapest architecture may still be transcription plus a lower-cost text model, not a full speech-to-speech agent for every interaction.

Sources: OpenAI announcement and OpenAI API pricing.