What is it
Media-Minute Pricing is a billing unit where customers are charged per minute of audio or video processed — used by speech, voice, and video AI vendors.
A media minute is the duration unit of AI that hears, speaks, or generates moving pictures. Where text models meter tokens and infrastructure meters GPU-hours, speech and video products meter the length of the media — a minute of audio transcribed, a minute of speech synthesized, a minute of conversational video rendered. The reason is structural: audio and video have no natural token boundary the buyer can count in advance, but they always have a runtime. A 12-minute support call, a 30-second ad, a one-hour podcast — each carries an obvious, estimable duration that maps closely to the compute required to process it.
The unit is shared across products that look very different on the surface. Deepgram, Speechmatics, and Rev AI all bill speech-to-text by the minute (or hour) of input audio. Bland AI and Parloa bill voice agents by the connected minute of a phone call. On the video side, Twelve Labs bills video understanding by the minute of source footage, while Tavus bills real-time conversational video by the minute of interaction. The same word — “minute” — covers transcription, synthesis, agents, and generation.
What makes the unit interesting is the spread. The cheapest minute in this corpus, machine transcription on Rev AI’s Reverb Turbo, costs roughly $0.0017; the most expensive, human transcription on the same platform, costs $1.99 — more than a thousand times higher for the same sixty seconds of audio. Between them sit voice-agent minutes (Bland AI at $0.11–$0.14), and conversational-video minutes (Tavus at $0.37). The minute is one unit; the price is a function of what happens during that minute. See choosing the right usage metric for why duration is the natural fit here.
How it works
The core formula is simple: media cost equals the per-minute rate for the chosen model, multiplied by the minutes of audio or video processed (usually metered to the second and rounded up, often with a short minimum). The complexity lives in the dimensions wrapped around that minute — which task, which model, real-time versus batch, accuracy tier, and whether the vendor exposes the minute directly or hides it behind credits.
| Dimension | What it controls | Example from this corpus |
|---|---|---|
| Task type | Transcription, synthesis, agents, or video each get their own meter | Deepgram: STT per minute, TTS per 1k characters, Voice Agent per minute |
| Model / accuracy tier | Faster or more accurate models cost more per minute | Speechmatics: standard $0.24/hr vs enhanced $0.40/hr (batch) |
| Real-time vs batch | Streaming carries a premium over pre-recorded | Speechmatics: real-time enhanced $0.56/hr vs batch enhanced $0.40/hr |
| Human vs machine | Human-in-the-loop is the expensive variant of the same API | Rev AI: Whisper $0.005/min vs human transcription $1.99/min |
| Minute vs credit packaging | Whether the buyer sees minutes or a converted credit | Synthesia: 1,200 credits = 10 video minutes/month |
The display unit is frequently a presentation choice rather than the meter. Speechmatics and Twelve Labs both ship a toggle that re-expresses the identical rate as $/minute or $/hour, and Rev AI quotes its own Reverb models per hour but Whisper and human transcription per minute on the same card. Higher up the stack, Synthesia, Hedra, and ElevenLabs sell a credit pool that converts to minutes — the minute is the real value metric, but the buyer transacts in credits.
Unit math: Transcribing a 60-minute podcast on Deepgram’s Nova-3 streaming ($0.0048/min) costs 60 × $0.0048 = $0.29. The same hour on Speechmatics enhanced batch ($0.40/hr) is $0.40. Run a 1,000-minute/month outbound voice-agent campaign on Bland AI’s Scale plan ($0.11/min) and the usage line is 1,000 × $0.11 = $110 on top of the $499 subscription. A 30-minute month of conversational video on Tavus ($0.37/min CVI) is 30 × $0.37 = $11.10.
Because the meter tracks duration, the same lever — commitment and tier — discounts it across vendors. Speechmatics auto-discounts usage above 500 hours/month by 20%; Deepgram’s Growth tier prepays annual credits for up to ~20% off the per-minute rate; Tavus lowers the CVI rate from $0.37/min to $0.32/min as you move up tiers. This per-minute discounting is the substance of the voice-API minute-billing trend — see also the introduction to usage-based pricing for the broader frame.
Companies using this
Seventeen companies in the corpus meter media minutes. They cluster into four groups: transcription and speech-to-text APIs (Deepgram, Speechmatics, Rev AI), voice agents and contact-center AI (Bland AI, Parloa, Kustomer, Krisp), text-to-speech and dubbing (ElevenLabs, Murf AI, WellSaid Labs), and AI video generation and understanding (Synthesia, Tavus, Hedra, Creatify, Twelve Labs, Descript, Fal).
Patterns observed
-
The minute is one unit, but the price encodes the work inside it. Rev AI is the clearest demonstration: machine Whisper transcription at $0.005/min and human transcription at $1.99/min ride the same pay-as-you-go API, letting a buyer trade cost against accuracy per file. Speechmatics splits “standard” and “enhanced” accuracy into separate per-hour SKUs ($0.24 vs $0.40/hr batch), and Deepgram’s Voice Agent API runs from $0.075/min Standard to $0.163/min Advanced. The duration is constant; the per-minute rate is where the product differentiation lives.
-
Display unit and meter are often different things. Speechmatics and Twelve Labs both ship a per-minute / per-hour toggle over an identical underlying rate, and Rev AI mixes per-hour (Reverb), per-minute (Whisper), and per-10-words (Insights) meters on a single card. The “minute” a buyer sees is frequently a readability convention layered over per-second metering — Bland AI bills every connected call second and only quotes a per-minute headline.
-
Video vendors hide the minute behind credits more often than audio vendors. Synthesia sells credits that convert to video minutes (1,200 credits = 10 minutes/month), Hedra and Creatify bundle credit pools into subscription tiers, and ElevenLabs uses credits on its creative ladder but minutes on its agents ladder. Pure transcription APIs — Deepgram, Speechmatics, Rev AI — tend to quote the raw per-minute rate without a credit layer. The further from a developer API and the closer to a creative tool, the more likely the minute is wrapped.
-
Real-time and streaming carry a premium over batch. Speechmatics prices real-time enhanced accuracy at $0.56/hr versus $0.40/hr for batch enhanced, and Deepgram flags its streaming STT rates as distinct from pre-recorded. Live conversational products — Bland AI’s phone agents, Tavus’s real-time CVI — sit at the high end of the per-minute range precisely because the minute must be processed as it happens.
-
Free minutes are the standard on-ramp, with one notable holdout. Speechmatics gives 2,400 free STT minutes/month, Rev AI starts every account with credits worth 5 hours of Reverb ASR, Deepgram includes $200 free credit, and Murf AI refreshes $10 of API credit monthly. Bland AI is the exception — no free minutes on any plan, every connected second billed from the start, betting that voice-agent buyers are past the trial stage.
Counterexamples & variants
The most common variant is the vendor that generates speech but bills it by the character, not the minute. Deepgram’s Aura TTS is $0.030/1k characters, Speechmatics TTS is $0.011/1k characters, and Murf AI’s API is $0.01–$0.03 per 1,000 characters. These companies meter media minutes for transcription or agents but switch to per-character billing for synthesis, because the input to synthesis is text of known length, where the input to transcription is audio of unknown word count. The same vendor runs two units side by side, and only one of them is the minute.
Parloa is the variant that proves the minute can be the meter without ever appearing on a price list. Parloa publishes no public pricing — its /pricing path 404s — and sells contact-center voice automation as a sales-led annual contract with an indicative floor around $300,000/year per third-party reviews. The connected minute is almost certainly the underlying cost driver, but the buyer never sees a per-minute rate; they negotiate a sales-led enterprise deal. The minute is real and load-bearing, but it is invisible at the point of sale.
WellSaid Labs sits at the opposite extreme: it produces per-minute media (AI voiceover) but bills almost entirely by the seat — Creative at $50/user/month, Business at $160/user/month — rather than by minutes consumed. For a content team that generates voiceovers all day, a flat seat removes the per-minute anxiety entirely. Descript and Krisp lean the same way, leading with per-seat subscriptions and treating minutes as an allowance inside the plan rather than a line item. In these cases media minutes exist in the taxonomy but are not the unit the buyer transacts in.
Finally, Kustomer and Fal show the minute as a secondary meter. Kustomer is a seat-priced CRM whose Voice channel is a pay-as-you-go add-on from $0.02/minute — the minute rides alongside seats and per-resolution AI charges, not as the headline. Fal is a generative-media GPU platform that bills most models per output (per image, per video) but exposes per-second video rates ($0.05/s for Wan 2.5, $0.4/s for Veo 3) that resolve to a media-minute unit. The minute appears, but as one meter among several rather than the spine of the pricing.
What this means for buyers vs vendors
For buyers
Estimate your monthly minutes before you compare rates — your bill is dominated by volume, not by the headline number. A team transcribing 10,000 minutes/month sees a real difference between Deepgram Nova-3 at $0.0048/min ($48) and a $0.40/hr batch alternative, but a team doing 200 minutes/month will barely feel it. Match the meter to the task: transcription is metered per minute of input audio, synthesis is usually per character of input text, so a “voice AI” quote that mixes both needs to be split before you can compare it. Check whether the minute is real or rounded — Bland AI bills per connected second under a per-minute headline, which behaves very differently from a vendor that rounds every short job up to a full minute. Watch for the credit layer: when Synthesia or Hedra sells you credits, convert them back to minutes (Synthesia’s Basic plan is 10 minutes for 1,200 credits) so you are comparing minutes to minutes. And if you generate media all day, price the seat-based variant — WellSaid Labs and Descript may beat any per-minute meter for high-volume creative teams. See choosing the right usage metric and the introduction to usage-based pricing for the framing.
For vendors
The media minute is the most intuitive meter you can offer a speech or video buyer — they already think in call length and video duration — but it is also the most directly comparable, so your per-minute rate sits next to every competitor’s. Differentiate inside the minute rather than on it: split accuracy or speed tiers the way Speechmatics separates standard from enhanced, or stack a model ladder like Deepgram’s $0.075–$0.163/min Voice Agent range, so buyers self-select into the rate that matches their need. Decide deliberately whether to expose the minute or wrap it: a developer API wins on a transparent per-minute card (Rev AI), while a creative tool can escape rate-card comparison by selling credits that convert to minutes (Synthesia, ElevenLabs). Use a free-minute allowance as the on-ramp — Speechmatics’s 2,400 free minutes/month is a low-friction trial — unless your buyer is past the experimentation stage, as Bland AI bets. Whatever you choose, you need per-second attribution of media duration to a customer and a job, which is a heavier metering pipeline than counting requests; see tracking and metering usage events and billing cycles and invoicing.
| Company | Product | Pricing model | Billing units | Free tier | Verified |
|---|---|---|---|---|---|
| Bland AI | AI phone call automation platform — inbound and outbound voice agents at scale | hybridpure-usagesubscription | api-callscreditsmedia-minutes | Yes | 2026-05-29 |
| Creatify | AI ad-creative platform — turns a product URL into video and image ads | hybridfreemium | creditsseatsmedia-minutes | Yes | 2026-06-08 |
| Deepgram | Usage-based speech-to-text, text-to-speech, and voice agent APIs | pure-usagefreemium | media-minutestokenscredits+1 | Yes | 2026-05-31 |
| Descript | AI-powered audio and video editing | hybridfreemium | seatscreditsmedia-minutes | Yes | 2026-05-31 |
| ElevenLabs | Voice AI platform across ElevenCreative, ElevenAgents, and ElevenAPI | subscriptionpure-usagehybrid | characterscreditsmedia-minutes+1 | Yes | 2026-05-28 |
| Fal | Generative-media inference platform — serverless per-output model APIs plus dedicated GPU compute | pure-usage | gpu-hoursrequestsmedia-minutes | No | 2026-06-01 |
| Hedra | AI video, avatar, image, and audio generation platform (Hedra Studio + API) | subscriptionfreemium | creditsmedia-minutescharacters+1 | Yes | 2026-06-04 |
| Krisp | AI noise-cancellation, meeting transcription/notes, call-center voice AI, and a developer Voice AI SDK | seat-based | seatsstorage-gbmedia-minutes | Yes | 2026-06-04 |
| Kustomer | AI-first CRM and customer-service platform unifying omnichannel support, automation, and AI agents | hybridseat-basedoutcome-based | seatsresolutionsmedia-minutes+1 | No | 2026-06-07 |
| Murf AI | AI voice / text-to-speech platform (Murf Studio app + Murf API) | subscriptionpure-usagefreemium | media-minutesseatscredits | Yes | 2026-06-01 |
| Parloa | Enterprise AI Agent Management Platform (AMP) for contact-center voice and chat automation | pure-usage | media-minutesresolutions | No | 2026-06-07 |
| Rev AI | Pay-as-you-go speech-to-text, transcription, and audio-intelligence APIs | pure-usagefreemium | media-minutescreditsapi-calls | Yes | 2026-06-04 |
| Speechmatics | Speech-to-text and text-to-speech APIs with per-hour usage pricing | pure-usagefreemium | media-minutescharacters | Yes | 2026-06-04 |
| Synthesia | Enterprise AI video generation | subscriptionfreemium | creditsmedia-minutesseats | Yes | 2026-05-31 |
| Tavus | Conversational Video Interface (CVI) API for real-time AI humans / avatars, plus PALs consumer AI companions | hybridfreemium | media-minutes | Yes | 2026-06-01 |
| Twelve Labs | Video understanding foundation models (Marengo for search/embeddings, Pegasus for analysis) delivered as a usage-metered API | pure-usagefreemiumcommitment | media-minutestokensrequests | Yes | 2026-06-02 |
| WellSaid Labs | AI text-to-speech voiceover studio with 100+ voices for content teams | seat-basedfreemium | seatsmedia-minutes | Yes | 2026-06-04 |
FAQ
What is media-minute pricing?
Media-minute pricing is a billing unit where customers are charged per minute of audio or video processed. It is the native meter for speech-to-text, text-to-speech, voice agents, and AI video, because the duration of the media maps directly to the compute cost of generating or transcribing it.
How much does it cost to transcribe a minute of audio?
Machine transcription is cheap and varies by model. In this corpus Rev AI's Reverb Turbo is $0.10/hr (about $0.0017/min) and Deepgram's Nova-3 streaming is $0.0048/min, while Speechmatics charges $0.24/hr for standard accuracy. Human transcription is far more expensive — Rev AI lists it at $1.99/min through the same API.
Why do speech and video vendors bill per minute instead of per token?
Audio and video have no natural token boundary, but they do have a duration. A minute of speech is a stable, intuitive unit that buyers can estimate from call logs or video length, and it tracks the underlying compute closely. Vendors like Twelve Labs and Tavus meter video by the minute for the same reason transcription vendors meter audio by the minute.
What is the difference between per-minute and per-character pricing for voice AI?
Speech-to-text (transcription) is metered by the minute of input audio, because you cannot know the word count in advance. Text-to-speech (synthesis) is usually metered per character of input text — Deepgram's Aura is $0.030/1k characters and Speechmatics TTS is $0.011/1k characters — because the text length is known and predicts the output. Many vendors run both meters side by side.
Do per-minute vendors offer free minutes?
Most do. Speechmatics gives every account 2,400 free STT minutes per month, Rev AI starts with credits worth 5 hours of Reverb ASR, and Deepgram includes $200 in free credit. Bland AI is a notable exception — it bills every connected call second from the first minute with no free allowance.
Which companies use media-minute pricing?
In this corpus 17 companies meter media minutes, including transcription APIs (Deepgram, Speechmatics, Rev AI), voice agents (Bland AI, Parloa), text-to-speech and dubbing (ElevenLabs, Murf AI, WellSaid Labs), AI video (Synthesia, Tavus, Hedra, Creatify, Twelve Labs), and platforms that fold minutes into a broader meter (Descript, Krisp, Fal, Kustomer).
Trivia
-
The same minute of audio spans nearly three orders of magnitude in this corpus: Rev AI transcribes a minute of speech for about $0.0017 (Reverb Turbo at $0.10/hr) while its own human transcription costs $1.99/min — and a conversational video minute on Tavus runs $0.37/min, roughly 200x the cheapest machine transcription.
-
Several "per-minute" vendors do not actually publish a per-minute meter. Speechmatics and Twelve Labs both expose a per-minute / per-hour toggle that re-expresses the identical rate, while Synthesia and Hedra sell credits that silently convert to minutes — Synthesia's 1,200-credit Basic plan equals exactly 10 minutes of video per month.
-
Bland AI's per-minute rate is all-inclusive — LLM inference, speech-to-text, text-to-speech, and telephony are bundled into one $0.11-$0.14/min number — whereas Deepgram's Voice Agent API stacks the meter the opposite way, from $0.075/min Standard to $0.163/min Advanced with cheaper bring-your-own-LLM and bring-your-own-TTS variants in between.
Related billing units
- Credit-Based BillingA billing unit where customers pre-purchase or are allocated a pool of credits that deplete as they use the product, often at variable rates per feature.
- Token-Based PricingA billing unit common in LLM and AI products, where customers are charged per input and output token processed.
- Per-Seat PricingA billing unit where the vendor charges a fixed fee per named user, regardless of how much each user consumes.
- Per-Resolution PricingA billing unit unique to AI customer-support products, where the vendor charges only when an AI agent resolves a customer issue without escalation.
- Bandwidth-Based PricingA billing unit where customers are charged per gigabyte of data transferred out of the platform.
- Per-Function-Invocation PricingA billing unit where customers are charged per serverless function invocation, often combined with a separate compute-time charge.
- CPU-Hour PricingA billing unit where customers are charged for the CPU time their workloads consume, typically measured in vCPU-seconds or vCPU-hours.
- GB-Hour PricingA billing unit where customers are charged for the memory their workloads consume over time, measured in gigabyte-hours.
- GPU-Hour PricingA billing unit where customers are charged for GPU time consumed, typically measured per-second or per-hour by GPU type.
- Per-API-Call PricingA billing unit where customers are charged per API request, regardless of payload size or processing time.
- Per-GB Storage PricingA billing unit where customers are charged per gigabyte of data stored on the platform per month.