Media-Minute Pricing: Examples & Companies

17 companies in the corpus Updated full analysis
Definition

Media-Minute Pricing is a billing unit where customers are charged per minute of audio or video processed — used by speech, voice, and video AI vendors.

Also known as: Per-Minute Audio/VideoAudio Minute Billing

What is it

Media-Minute Pricing is a billing unit where customers are charged per minute of audio or video processed — used by speech, voice, and video AI vendors.

A media minute is the duration unit of AI that hears, speaks, or generates moving pictures. Where text models meter tokens and infrastructure meters GPU-hours, speech and video products meter the length of the media — a minute of audio transcribed, a minute of speech synthesized, a minute of conversational video rendered. The reason is structural: audio and video have no natural token boundary the buyer can count in advance, but they always have a runtime. A 12-minute support call, a 30-second ad, a one-hour podcast — each carries an obvious, estimable duration that maps closely to the compute required to process it.

The unit is shared across products that look very different on the surface. Deepgram, Speechmatics, and Rev AI all bill speech-to-text by the minute (or hour) of input audio. Bland AI and Parloa bill voice agents by the connected minute of a phone call. On the video side, Twelve Labs bills video understanding by the minute of source footage, while Tavus bills real-time conversational video by the minute of interaction. The same word — “minute” — covers transcription, synthesis, agents, and generation.

What makes the unit interesting is the spread. The cheapest minute in this corpus, machine transcription on Rev AI’s Reverb Turbo, costs roughly $0.0017; the most expensive, human transcription on the same platform, costs $1.99 — more than a thousand times higher for the same sixty seconds of audio. Between them sit voice-agent minutes (Bland AI at $0.11–$0.14), and conversational-video minutes (Tavus at $0.37). The minute is one unit; the price is a function of what happens during that minute. See choosing the right usage metric for why duration is the natural fit here.


How it works

The core formula is simple: media cost equals the per-minute rate for the chosen model, multiplied by the minutes of audio or video processed (usually metered to the second and rounded up, often with a short minimum). The complexity lives in the dimensions wrapped around that minute — which task, which model, real-time versus batch, accuracy tier, and whether the vendor exposes the minute directly or hides it behind credits.

DimensionWhat it controlsExample from this corpus
Task typeTranscription, synthesis, agents, or video each get their own meterDeepgram: STT per minute, TTS per 1k characters, Voice Agent per minute
Model / accuracy tierFaster or more accurate models cost more per minuteSpeechmatics: standard $0.24/hr vs enhanced $0.40/hr (batch)
Real-time vs batchStreaming carries a premium over pre-recordedSpeechmatics: real-time enhanced $0.56/hr vs batch enhanced $0.40/hr
Human vs machineHuman-in-the-loop is the expensive variant of the same APIRev AI: Whisper $0.005/min vs human transcription $1.99/min
Minute vs credit packagingWhether the buyer sees minutes or a converted creditSynthesia: 1,200 credits = 10 video minutes/month

The display unit is frequently a presentation choice rather than the meter. Speechmatics and Twelve Labs both ship a toggle that re-expresses the identical rate as $/minute or $/hour, and Rev AI quotes its own Reverb models per hour but Whisper and human transcription per minute on the same card. Higher up the stack, Synthesia, Hedra, and ElevenLabs sell a credit pool that converts to minutes — the minute is the real value metric, but the buyer transacts in credits.

Unit math: Transcribing a 60-minute podcast on Deepgram’s Nova-3 streaming ($0.0048/min) costs 60 × $0.0048 = $0.29. The same hour on Speechmatics enhanced batch ($0.40/hr) is $0.40. Run a 1,000-minute/month outbound voice-agent campaign on Bland AI’s Scale plan ($0.11/min) and the usage line is 1,000 × $0.11 = $110 on top of the $499 subscription. A 30-minute month of conversational video on Tavus ($0.37/min CVI) is 30 × $0.37 = $11.10.

Because the meter tracks duration, the same lever — commitment and tier — discounts it across vendors. Speechmatics auto-discounts usage above 500 hours/month by 20%; Deepgram’s Growth tier prepays annual credits for up to ~20% off the per-minute rate; Tavus lowers the CVI rate from $0.37/min to $0.32/min as you move up tiers. This per-minute discounting is the substance of the voice-API minute-billing trend — see also the introduction to usage-based pricing for the broader frame.


Companies using this

Seventeen companies in the corpus meter media minutes. They cluster into four groups: transcription and speech-to-text APIs (Deepgram, Speechmatics, Rev AI), voice agents and contact-center AI (Bland AI, Parloa, Kustomer, Krisp), text-to-speech and dubbing (ElevenLabs, Murf AI, WellSaid Labs), and AI video generation and understanding (Synthesia, Tavus, Hedra, Creatify, Twelve Labs, Descript, Fal).


Patterns observed

  • The minute is one unit, but the price encodes the work inside it. Rev AI is the clearest demonstration: machine Whisper transcription at $0.005/min and human transcription at $1.99/min ride the same pay-as-you-go API, letting a buyer trade cost against accuracy per file. Speechmatics splits “standard” and “enhanced” accuracy into separate per-hour SKUs ($0.24 vs $0.40/hr batch), and Deepgram’s Voice Agent API runs from $0.075/min Standard to $0.163/min Advanced. The duration is constant; the per-minute rate is where the product differentiation lives.

  • Display unit and meter are often different things. Speechmatics and Twelve Labs both ship a per-minute / per-hour toggle over an identical underlying rate, and Rev AI mixes per-hour (Reverb), per-minute (Whisper), and per-10-words (Insights) meters on a single card. The “minute” a buyer sees is frequently a readability convention layered over per-second metering — Bland AI bills every connected call second and only quotes a per-minute headline.

  • Video vendors hide the minute behind credits more often than audio vendors. Synthesia sells credits that convert to video minutes (1,200 credits = 10 minutes/month), Hedra and Creatify bundle credit pools into subscription tiers, and ElevenLabs uses credits on its creative ladder but minutes on its agents ladder. Pure transcription APIs — Deepgram, Speechmatics, Rev AI — tend to quote the raw per-minute rate without a credit layer. The further from a developer API and the closer to a creative tool, the more likely the minute is wrapped.

  • Real-time and streaming carry a premium over batch. Speechmatics prices real-time enhanced accuracy at $0.56/hr versus $0.40/hr for batch enhanced, and Deepgram flags its streaming STT rates as distinct from pre-recorded. Live conversational products — Bland AI’s phone agents, Tavus’s real-time CVI — sit at the high end of the per-minute range precisely because the minute must be processed as it happens.

  • Free minutes are the standard on-ramp, with one notable holdout. Speechmatics gives 2,400 free STT minutes/month, Rev AI starts every account with credits worth 5 hours of Reverb ASR, Deepgram includes $200 free credit, and Murf AI refreshes $10 of API credit monthly. Bland AI is the exception — no free minutes on any plan, every connected second billed from the start, betting that voice-agent buyers are past the trial stage.


Counterexamples & variants

The most common variant is the vendor that generates speech but bills it by the character, not the minute. Deepgram’s Aura TTS is $0.030/1k characters, Speechmatics TTS is $0.011/1k characters, and Murf AI’s API is $0.01–$0.03 per 1,000 characters. These companies meter media minutes for transcription or agents but switch to per-character billing for synthesis, because the input to synthesis is text of known length, where the input to transcription is audio of unknown word count. The same vendor runs two units side by side, and only one of them is the minute.

Parloa is the variant that proves the minute can be the meter without ever appearing on a price list. Parloa publishes no public pricing — its /pricing path 404s — and sells contact-center voice automation as a sales-led annual contract with an indicative floor around $300,000/year per third-party reviews. The connected minute is almost certainly the underlying cost driver, but the buyer never sees a per-minute rate; they negotiate a sales-led enterprise deal. The minute is real and load-bearing, but it is invisible at the point of sale.

WellSaid Labs sits at the opposite extreme: it produces per-minute media (AI voiceover) but bills almost entirely by the seat — Creative at $50/user/month, Business at $160/user/month — rather than by minutes consumed. For a content team that generates voiceovers all day, a flat seat removes the per-minute anxiety entirely. Descript and Krisp lean the same way, leading with per-seat subscriptions and treating minutes as an allowance inside the plan rather than a line item. In these cases media minutes exist in the taxonomy but are not the unit the buyer transacts in.

Finally, Kustomer and Fal show the minute as a secondary meter. Kustomer is a seat-priced CRM whose Voice channel is a pay-as-you-go add-on from $0.02/minute — the minute rides alongside seats and per-resolution AI charges, not as the headline. Fal is a generative-media GPU platform that bills most models per output (per image, per video) but exposes per-second video rates ($0.05/s for Wan 2.5, $0.4/s for Veo 3) that resolve to a media-minute unit. The minute appears, but as one meter among several rather than the spine of the pricing.


What this means for buyers vs vendors

For buyers

Estimate your monthly minutes before you compare rates — your bill is dominated by volume, not by the headline number. A team transcribing 10,000 minutes/month sees a real difference between Deepgram Nova-3 at $0.0048/min ($48) and a $0.40/hr batch alternative, but a team doing 200 minutes/month will barely feel it. Match the meter to the task: transcription is metered per minute of input audio, synthesis is usually per character of input text, so a “voice AI” quote that mixes both needs to be split before you can compare it. Check whether the minute is real or rounded — Bland AI bills per connected second under a per-minute headline, which behaves very differently from a vendor that rounds every short job up to a full minute. Watch for the credit layer: when Synthesia or Hedra sells you credits, convert them back to minutes (Synthesia’s Basic plan is 10 minutes for 1,200 credits) so you are comparing minutes to minutes. And if you generate media all day, price the seat-based variant — WellSaid Labs and Descript may beat any per-minute meter for high-volume creative teams. See choosing the right usage metric and the introduction to usage-based pricing for the framing.

For vendors

The media minute is the most intuitive meter you can offer a speech or video buyer — they already think in call length and video duration — but it is also the most directly comparable, so your per-minute rate sits next to every competitor’s. Differentiate inside the minute rather than on it: split accuracy or speed tiers the way Speechmatics separates standard from enhanced, or stack a model ladder like Deepgram’s $0.075–$0.163/min Voice Agent range, so buyers self-select into the rate that matches their need. Decide deliberately whether to expose the minute or wrap it: a developer API wins on a transparent per-minute card (Rev AI), while a creative tool can escape rate-card comparison by selling credits that convert to minutes (Synthesia, ElevenLabs). Use a free-minute allowance as the on-ramp — Speechmatics’s 2,400 free minutes/month is a low-friction trial — unless your buyer is past the experimentation stage, as Bland AI bets. Whatever you choose, you need per-second attribution of media duration to a customer and a job, which is a heavier metering pipeline than counting requests; see tracking and metering usage events and billing cycles and invoicing.


Company Product Pricing modelBilling unitsFree tier Verified
Bland AIAI phone call automation platform — inbound and outbound voice agents at scale
hybridpure-usagesubscription
api-callscreditsmedia-minutes
Yes2026-05-29
CreatifyAI ad-creative platform — turns a product URL into video and image ads
hybridfreemium
creditsseatsmedia-minutes
Yes2026-06-08
DeepgramUsage-based speech-to-text, text-to-speech, and voice agent APIs
pure-usagefreemium
media-minutestokenscredits+1
Yes2026-05-31
DescriptAI-powered audio and video editing
hybridfreemium
seatscreditsmedia-minutes
Yes2026-05-31
ElevenLabsVoice AI platform across ElevenCreative, ElevenAgents, and ElevenAPI
subscriptionpure-usagehybrid
characterscreditsmedia-minutes+1
Yes2026-05-28
FalGenerative-media inference platform — serverless per-output model APIs plus dedicated GPU compute
pure-usage
gpu-hoursrequestsmedia-minutes
No2026-06-01
HedraAI video, avatar, image, and audio generation platform (Hedra Studio + API)
subscriptionfreemium
creditsmedia-minutescharacters+1
Yes2026-06-04
KrispAI noise-cancellation, meeting transcription/notes, call-center voice AI, and a developer Voice AI SDK
seat-based
seatsstorage-gbmedia-minutes
Yes2026-06-04
KustomerAI-first CRM and customer-service platform unifying omnichannel support, automation, and AI agents
hybridseat-basedoutcome-based
seatsresolutionsmedia-minutes+1
No2026-06-07
Murf AIAI voice / text-to-speech platform (Murf Studio app + Murf API)
subscriptionpure-usagefreemium
media-minutesseatscredits
Yes2026-06-01
ParloaEnterprise AI Agent Management Platform (AMP) for contact-center voice and chat automation
pure-usage
media-minutesresolutions
No2026-06-07
Rev AIPay-as-you-go speech-to-text, transcription, and audio-intelligence APIs
pure-usagefreemium
media-minutescreditsapi-calls
Yes2026-06-04
SpeechmaticsSpeech-to-text and text-to-speech APIs with per-hour usage pricing
pure-usagefreemium
media-minutescharacters
Yes2026-06-04
SynthesiaEnterprise AI video generation
subscriptionfreemium
creditsmedia-minutesseats
Yes2026-05-31
TavusConversational Video Interface (CVI) API for real-time AI humans / avatars, plus PALs consumer AI companions
hybridfreemium
media-minutes
Yes2026-06-01
Twelve LabsVideo understanding foundation models (Marengo for search/embeddings, Pegasus for analysis) delivered as a usage-metered API
pure-usagefreemiumcommitment
media-minutestokensrequests
Yes2026-06-02
WellSaid LabsAI text-to-speech voiceover studio with 100+ voices for content teams
seat-basedfreemium
seatsmedia-minutes
Yes2026-06-04

FAQ

What is media-minute pricing?

Media-minute pricing is a billing unit where customers are charged per minute of audio or video processed. It is the native meter for speech-to-text, text-to-speech, voice agents, and AI video, because the duration of the media maps directly to the compute cost of generating or transcribing it.

How much does it cost to transcribe a minute of audio?

Machine transcription is cheap and varies by model. In this corpus Rev AI's Reverb Turbo is $0.10/hr (about $0.0017/min) and Deepgram's Nova-3 streaming is $0.0048/min, while Speechmatics charges $0.24/hr for standard accuracy. Human transcription is far more expensive — Rev AI lists it at $1.99/min through the same API.

Why do speech and video vendors bill per minute instead of per token?

Audio and video have no natural token boundary, but they do have a duration. A minute of speech is a stable, intuitive unit that buyers can estimate from call logs or video length, and it tracks the underlying compute closely. Vendors like Twelve Labs and Tavus meter video by the minute for the same reason transcription vendors meter audio by the minute.

What is the difference between per-minute and per-character pricing for voice AI?

Speech-to-text (transcription) is metered by the minute of input audio, because you cannot know the word count in advance. Text-to-speech (synthesis) is usually metered per character of input text — Deepgram's Aura is $0.030/1k characters and Speechmatics TTS is $0.011/1k characters — because the text length is known and predicts the output. Many vendors run both meters side by side.

Do per-minute vendors offer free minutes?

Most do. Speechmatics gives every account 2,400 free STT minutes per month, Rev AI starts with credits worth 5 hours of Reverb ASR, and Deepgram includes $200 in free credit. Bland AI is a notable exception — it bills every connected call second from the first minute with no free allowance.

Which companies use media-minute pricing?

In this corpus 17 companies meter media minutes, including transcription APIs (Deepgram, Speechmatics, Rev AI), voice agents (Bland AI, Parloa), text-to-speech and dubbing (ElevenLabs, Murf AI, WellSaid Labs), AI video (Synthesia, Tavus, Hedra, Creatify, Twelve Labs), and platforms that fold minutes into a broader meter (Descript, Krisp, Fal, Kustomer).

Trivia

  • The same minute of audio spans nearly three orders of magnitude in this corpus: Rev AI transcribes a minute of speech for about $0.0017 (Reverb Turbo at $0.10/hr) while its own human transcription costs $1.99/min — and a conversational video minute on Tavus runs $0.37/min, roughly 200x the cheapest machine transcription.

  • Several "per-minute" vendors do not actually publish a per-minute meter. Speechmatics and Twelve Labs both expose a per-minute / per-hour toggle that re-expresses the identical rate, while Synthesia and Hedra sell credits that silently convert to minutes — Synthesia's 1,200-credit Basic plan equals exactly 10 minutes of video per month.

  • Bland AI's per-minute rate is all-inclusive — LLM inference, speech-to-text, text-to-speech, and telephony are bundled into one $0.11-$0.14/min number — whereas Deepgram's Voice Agent API stacks the meter the opposite way, from $0.075/min Standard to $0.163/min Advanced with cheaper bring-your-own-LLM and bring-your-own-TTS variants in between.

See all pricing trivia

Related billing units

Back to companies