New 14 companies · First observed June 2024

Voice APIs Converge on Per-Minute Billing

Quick answer

Fourteen of fifteen voice and audio AI companies in the corpus bill on media-minutes as a primary unit. The shift from per-character (batch TTS era) to per-minute (agent call era) reflects the dominant use case moving from content generation to real-time conversational AI.

14 / 15 voice API vendors bill on media-minutes

What's happening — and why

What's happening: the standard billing unit for voice AI has converged on media-minutes — per-minute or per-hour — rather than per-character or per-request. Fourteen of fifteen corpus voice companies use it as a primary unit.

Why: the dominant voice AI use case shifted. In the batch TTS era, producers turned text into audio files; the natural unit was the characters being spoken. In the agent call era, voice is deployed in real-time phone calls, voicebots, and conversational interfaces — where wall-clock time, not text volume, is the cost driver. Telephony and call-center buyers already think in minutes; per-minute aligns vendor pricing with buyer mental models.

ElevenLabs is the clearest example: it charges per-minute for Conversational AI (agent calls) and per-character for Studio TTS (batch). The two units coexist for the two use cases.

How it works

Batch TTS era Agent call era per character text → audio file batch, async studio TTS WellSaid, LMNT shift per minute real-time agent calls conversational AI phone bots, voicebots Bland, Deepgram, ElevenLabs, Cartesia, Rev, Tavus
Voice pricing shifted from per-character (batch TTS) to per-minute (agent calls) as real-time use cases dominated.

Evidence over time

14 supporting · 1 counter — hover or tap a point for detail, click to jump to the row.

supports ↑ challenges ↓ 2024 2025 2026
supporting evidence counterexample

Evidence

Company Date What happened
bland-ai Jun 2024 Billed entirely on media-minutes; phone-agent model fits per-minute naturally
elevenlabs May 2026 Conversational AI (agent calls) price cut to per-minute rate; retains per-character for Studio TTS. Both units coexist in billing.
cartesia Feb 2026 Voice Agents GA at flat per-minute rate; prior API used credits/requests
deepgram Jan 2025 Transcription and TTS both per-minute; Nova-2 ASR $0.0043/min, Aura TTS $0.0150/min
tavus Jan 2025 Entire model is hybrid access fee + pay-as-you-go video minutes; per-minute is the only consumption unit
speechmatics Jun 2025 Per-hour STT, per-character TTS — both units present; moving toward per-minute for real-time
murf-ai Jun 2026 Murf API launched with per-character and per-minute lanes; Studio plans cap on minutes
rev-ai Jan 2025 Pure usage per-minute; transcription billed in 15-second increments
krisp Jun 2025 Call Center product bills on accent-minutes; per-agent seats plus minute consumption
synthesia May 2026 Video-minute credits drive all plan tiers; minutes are the primary consumption signal
twelve-labs Jun 2025 Video understanding billed per video-minute indexed; minutes is the primary query unit
wellsaid Jun 2026 Annual download quotas expressed as minutes per plan tier; per-seat+minutes model
hedra Dec 2025 Credits map to video/audio seconds; effectively per-minute billing abstracted through credits
fal-ai Jun 2025 Audio/video models billed per second of output; effectively per-minute at scale

Counterexamples

  • lmnt · — — Charges per character for TTS only — no per-minute lane. Serves batch text-to-speech, not agent calls.
  • wellsaid · — — Per-seat + annual quota model dilutes the pure per-minute signal; enterprise customers are quota-capped, not metered.
  • descript · Jun 2025 — Media hours billed at tier level, not granularly per-minute; subscription model with hour pools

For buyers

Budget voice workloads in minutes, not characters. For batch content generation, characters may still be the efficient unit (WellSaid, LMNT). For agent calls and real-time voice, per-minute is the standard — model your cost on expected call durations and call volumes, not script length.

For vendors

If you are building a voice AI product, per-minute pricing aligns with the call-center and telephony mental model your buyers already use. If you serve both batch TTS and agent use cases, maintain both units (ElevenLabs' model): per-character for Studio, per-minute for Conversational.

Outlook — what to watch

As agent voice becomes the dominant voice AI use case, per-minute will further displace per-character. The holdout (LMNT, characters-only) is a batch-focused product. Watch for per-second granularity appearing in cost-sensitive high-volume deployments.

Bottom line

Voice AI billing has converged on media-minutes. Fourteen of fifteen corpus companies use it as a primary unit, driven by the shift from batch TTS to real-time agent calls.

FAQ

How do voice AI APIs charge for usage?

Almost universally per-minute or per-hour of audio. Fourteen of fifteen voice companies in the corpus use media-minutes. LMNT is the exception, charging per character for batch TTS.

Why per-minute instead of per-character?

Real-time agent calls — the dominant voice use case — are bounded by wall-clock time, not text volume. Per-minute aligns with telephony buyer mental models and the actual cost driver.

Does ElevenLabs charge per minute or per character?

Both. Conversational AI (agent calls) is priced per minute; Studio TTS (batch text-to-speech) is priced per character. The two billing units coexist for the two use cases.

All trends