Voice API pricing

AI Voice & TTS API Pricing

Compare developer pricing for text-to-speech, realtime voice, dubbing, translation, and speech APIs across ElevenLabs, Speechify, OpenAI, Google Cloud, and Amazon Polly. Last checked .

Quick answer: for simple narration, Amazon Polly Standard and Google Standard/WaveNet are cheapest at $4 per 1M characters. Speechify lists $10 per 1M characters. ElevenLabs is priced for higher-end AI voice quality at $0.05-$0.10 per 1K characters. OpenAI is strongest when the job is realtime voice, translation, or transcription rather than static narration.

  • Character-based rates normalize to 1M input characters. They do not include free tiers, taxes, committed-use discounts, or enterprise contracts.
  • Realtime, dubbing, and transcription products are not directly comparable to static TTS. OpenAI and ElevenLabs publish minute- or token-based rates for live audio products.
  • Speech duration estimate uses English narration at roughly 700-900 characters per finished minute. Actual minutes vary by language, pacing, punctuation, and SSML.
Provider Best fit Published rate Action
ElevenLabs API
Per 1K characters for TTS; per minute or hour for other audio products
High-quality voice generation, cloning, dubbing, and realtime agents
$0.05-$0.10 per 1K TTS characters
  • Flash / Turbo TTS: $0.05 per 1K characters
  • Multilingual v2/v3 TTS: $0.10 per 1K characters
  • Speech Engine agents: $0.08 per minute, burst at $0.16 per minute
  • Dubbing: $0.33 per source minute with watermark, $0.50 without watermark
Try ElevenLabs
Speechify API
Pay-as-you-go per character
Readable narration, accessibility, e-learning, and product voice features
$10 per 1M characters
  • Public API page advertises $10 per 1M characters
  • Free tier available for testing
  • Enterprise and on-premise deployment available by quote
Try Speechify
OpenAI TTS / Realtime audio
Audio tokens or per minute, depending on the audio product
Static TTS, realtime voice agents, speech translation, and streaming transcription
Realtime agents use audio tokens; translation starts at $0.034/min
  • GPT-Realtime-2 audio: $32 input / $64 output per 1M audio tokens
  • GPT-Realtime-2 cached audio input: $0.40 per 1M audio tokens
  • GPT-Realtime-Translate: $0.034 per minute
  • GPT-Realtime-Whisper: $0.017 per minute
  • Text side of GPT-Realtime-2: $4 input / $24 output per 1M text tokens
View OpenAI pricing
Google Cloud Text-to-Speech
Per character, with some newer speech generation priced by token
Cloud-native apps, large language coverage, and Google Cloud workloads
$4-$160 per 1M characters
  • Standard and WaveNet voices: $4 per 1M characters
  • Neural2 voices: $16 per 1M characters
  • Chirp 3 HD voices: $30 per 1M characters
  • Studio voices: $160 per 1M characters
  • Instant custom voice: $60 per 1M characters
  • Gemini TTS models are token-priced, not character-priced
View Google AI pricing
Amazon Polly
Per 1M characters
AWS workloads, low-cost standard voices, and speech marks
$4-$100 per 1M characters
  • Standard voices: $4 per 1M characters
  • Neural voices: $16 per 1M characters
  • Generative voices: $30 per 1M characters
  • Long-Form voices: $100 per 1M characters
Open AWS Polly

Cost per 1M characters

Character-based APIs are easiest to compare directly. For a 1,000,000 character TTS workload, the published list prices look like this before free tiers, taxes, discounts, enterprise commits, or token-priced realtime audio products.

Speechify API
Published API rate
$10
ElevenLabs
Flash / Turbo
$50
ElevenLabs
Multilingual v2/v3
$100
Google Cloud TTS
Standard / WaveNet
$4
Google Cloud TTS
Neural2
$16
Google Cloud TTS
Chirp 3 HD
$30
Google Cloud TTS
Studio
$160
Amazon Polly
Standard
$4
Amazon Polly
Neural
$16
Amazon Polly
Generative
$30
Amazon Polly
Long-Form
$100

Per-minute voice rates

Live voice, translation, transcription, and dubbing products often price by audio minute instead of input characters. These rates are easier to compare by minute and by hour.

Provider Product Per minute Per hour
OpenAI GPT-Realtime-Whisper transcription $0.017 $1
OpenAI GPT-Realtime-Translate $0.034 $2
ElevenLabs Speech Engine agents $0.08 $5
ElevenLabs Speech Engine burst $0.16 $10
ElevenLabs Dubbing with watermark $0.33 $20
ElevenLabs Dubbing without watermark $0.50 $30

How to choose

Pick Google Cloud Text-to-Speech or Amazon Polly when the main requirement is cheap, reliable narration at scale. Pick Speechify when you want a simple published API rate and a voice product tuned for readable narration. Pick ElevenLabs when voice quality, voice cloning, dubbing, agent voice, and expressiveness matter more than the absolute lowest character price.

OpenAI is a different pricing shape. Static TTS belongs with generated audio output, while GPT-Realtime-2 is a live multimodal model with audio token pricing. GPT-Realtime-Translate and GPT-Realtime-Whisper publish per-minute rates. Use OpenAI when the product is a realtime voice interface, live translation layer, or streaming transcription workflow, not just a batch TTS job.

For text-token model costs, use the AI API pricing table and token cost calculator. For voice workloads, model characters, audio minutes, concurrency, caching rights, and whether you need cloning or speech marks.

FAQ

Which TTS API is cheapest?

For basic cloud TTS, Google Cloud Standard/WaveNet and Amazon Polly Standard are both $4 per 1M characters. Speechify lists $10 per 1M characters. ElevenLabs starts higher at $50 per 1M characters for Flash/Turbo, but targets more expressive AI voice output.

Why is ElevenLabs more expensive than Polly or Google Standard voices?

ElevenLabs is optimized for expressive, low-latency AI voice generation, voice cloning, dubbing, and agent voice. Polly and Google Standard are cheaper for straightforward narration at scale.

How do character prices map to minutes of audio?

A rough English narration estimate is 700-900 characters per minute, depending on punctuation, speed, and language. One million characters often lands near 18-24 hours of finished speech.

Should voice agents use per-character TTS or realtime audio pricing?

If the app is turn-based narration, per-character TTS is easier to forecast. If users interrupt, talk over the system, or need live translation/transcription, realtime per-minute or audio-token pricing is the better model.

Sources