How much does AssemblyAI cost per hour of audio?

AssemblyAI bills async transcription by the hour: Universal-2 is $0.15/hr and the more accurate Universal-3.5 Pro is $0.21/hr. Real-time streaming ranges from $0.15/hr (Universal-Streaming) to $0.45/hr (Universal-3.5 Pro Realtime). A Voice Agent API is $4.50/hr ($0.075/min). Speech Understanding and Guardrails add-ons each carry an additional per-hour fee on top of the base rate.

Does AssemblyAI have a free tier?

Yes. You can create an account and start transcribing immediately with no credit card required. The free tier includes up to 185 hours of pre-recorded transcription and up to 333 hours of streaming. There is no monthly minimum — beyond the free tier you only pay for what you process.

What is the LLM Gateway and how is it priced?

The LLM Gateway is AssemblyAI's layer that lets developers run frontier LLMs from OpenAI, Anthropic, and Google directly against a transcript — summarization, Q&A, custom prompts. It is billed per million input and output tokens at each model's published rate, separately from transcription costs. It is the productized evolution of LeMUR.

What is the difference between AssemblyAI's Universal-2 and Universal-3.5 Pro models?

Universal-2 is the lower-cost async model at $0.15/hr, supporting 99 languages with excellent accuracy. Universal-3.5 Pro is the most accurate model at $0.21/hr, leading on multilingual word error rate. For real-time use, AssemblyAI offers Universal-Streaming and Universal-Streaming Multilingual (both $0.15/hr) and Universal-3.5 Pro Realtime ($0.45/hr).

Does AssemblyAI offer enterprise pricing?

Yes. Enterprise customers receive custom volume-discount pricing, dedicated support, and invoice billing, and AssemblyAI is available via the AWS Marketplace. Self-serve customers pay standard per-hour rates with no sales contact required.

How do Speech Understanding add-ons affect AssemblyAI's pricing?

Each Speech Understanding feature — speaker identification, entity detection, translation, PII redaction, and more — adds an incremental per-hour fee on top of base transcription. Enabling multiple features stacks those per-hour fees, so a heavily-featured transcript costs more than the base transcription rate.

AssemblyAI Pricing

AI Summary

AssemblyAI operates a pure usage-based pricing model billed per hour of audio: async transcription at $0.15/hr (Universal-2) or $0.21/hr (Universal-3.5 Pro), with no seat fees, no monthly minimums, and a free tier of up to 185 hours to test before committing.
Async transcription has two live models — Universal-2 ($0.15/hr) and the more accurate Universal-3.5 Pro ($0.21/hr); real-time streaming is published at $0.15/hr (Universal-Streaming and Universal-Streaming Multilingual) and $0.45/hr (Universal-3.5 Pro Realtime). A Voice Agent API is priced at $4.50/hr ($0.075/min).
Speech Understanding add-ons — including speaker identification, entity detection, translation, and PII redaction — each layer an additional per-hour fee on top of the base transcription rate.
The LLM Gateway lets developers run frontier LLMs (OpenAI, Anthropic, Google) against a transcript, billed per million input and output tokens separately from transcription — the productized evolution of LeMUR.
Enterprise customers get volume discounts, dedicated support, and custom contract pricing; the self-serve path is designed for developers who can start on the free tier and scale without a sales call.
AssemblyAI raised $50M in its Series C (January 2024) from Accel and Insight Partners, bringing total funding to ~$143M, and processes audio for hundreds of enterprise customers including dozens of Fortune 500 companies.

Pricing summary

AssemblyAI 2026 — Usage-based Speech AI pricing

No monthly minimums: async from $0.15/hr, streaming from $0.15/hr, add-ons per hour, LLM Gateway per token; Enterprise custom

Free Tier

Free

Developers evaluating the API

Universal-2

$0.15 /hr audio

Developers wanting accuracy at a lower price

Universal-3.5 Pro

$0.21 /hr audio

Teams needing the highest multilingual accuracy

Enterprise

Custom

Fortune 500 and high-volume platforms

Realtime Speech-to-Text

From $0.15 /hr

Live captions, voice agents, call centers

Voice Agent API

$4.50 /hr ($0.075/min)

Production voice agents built end-to-end

Async transcription billed per hour: Universal-2 $0.15/hr, Universal-3.5 Pro $0.21/hr. Realtime from $0.15/hr (Universal-3.5 Pro Realtime $0.45/hr). Voice Agent API $4.50/hr ($0.075/min). Speech Understanding, Guardrails, and add-on features add incremental per-hour fees; LLM Gateway billed per input/output token. Enterprise pricing requires a sales conversation for volume commitments.

About

AssemblyAI is a San Francisco-based AI company founded in 2017 that builds Speech AI APIs for developers and enterprises. The company’s core product is a suite of APIs that convert audio and video to text and extract structured intelligence from the resulting transcripts. Unlike general-purpose AI platforms, AssemblyAI is purpose-built for audio: every model, feature, and pricing dimension is designed around the economics of processing spoken language.

AssemblyAI’s customer base spans startups and Fortune 500 enterprises — the company has reported processing audio for hundreds of enterprise customers, including dozens in the Fortune 500. Customers include companies building products across call center analytics, meeting transcription, media captioning, voice agent platforms, and content intelligence.

The company raised a $50M Series C in January 2024 led by Accel with participation from Insight Partners, bringing total disclosed funding to approximately $143M. That round followed the November 2023 launch of Universal-1 — the company’s highest-accuracy English transcription model at the time — and positioned AssemblyAI as the leading independent speech AI infrastructure provider for developers, competing with Deepgram, Google Cloud Speech-to-Text, AWS Transcribe, and Microsoft Azure Speech.

AssemblyAI’s product suite now spans six priced surfaces: the Pre-recorded (async) Speech-to-Text API, the Realtime (streaming) Speech-to-Text API, a Voice Agent API, Speech Understanding (structured analysis of the transcript), Guardrails (safety/compliance filtering), and the LLM Gateway (an LLM reasoning layer applied directly to audio context, billed per token). Each surface is priced separately and composably, letting developers pay only for the capabilities they use.

Pricing summary : pure per-second billing, no monthly seat fees

AssemblyAI runs a pure usage-based pricing model. There are no subscription tiers, no seat fees, and no monthly minimums for self-serve customers. You pay for what you process: async transcription is billed per hour at $0.15/hr (Universal-2) or $0.21/hr (Universal-3.5 Pro); realtime streaming from $0.15/hr (up to $0.45/hr for Universal-3.5 Pro Realtime); a Voice Agent API at $4.50/hr ($0.075/min); incremental per-hour add-on fees for each Speech Understanding and Guardrails feature you enable; and separate per-token billing for the LLM Gateway.

This model mirrors how usage-based pricing works in cloud infrastructure — the bill expands proportionally with consumption, making it friendly for startups with variable audio volumes and potentially expensive for teams that underestimate usage. Unlike the flat-rate subscription models used by many AI tools (see the AI pricing shift away from per-user licenses), AssemblyAI’s pricing scales linearly with every hour of audio processed.

What makes this different: Most speech API vendors charge a single flat rate that hides model quality trade-offs. AssemblyAI separates model accuracy (Universal-2 vs. Universal-3.5 Pro) from feature add-ons (speaker ID, entities, translation, PII redaction, guardrails), giving developers granular control over the cost/capability trade-off. Stacking several Speech Understanding and Guardrails features can meaningfully raise the effective per-hour rate — a cost dynamic that is not obvious from the headline pricing.

Pricing by product

Pre-recorded Speech-to-Text API (Async)

Model	Price per hour	Key mechanics
Universal-3.5 Pro	$0.21	Most accurate async model; native code switching, works across 18 languages, most accurate speaker diarization
Universal-2	$0.15	Excellent accuracy at a lower price; supports 99 languages; trained on 12.5M+ hours

Custom rate limits, enhanced concurrency, and enterprise-grade flexibility are available via “Contact us” for high-volume workloads.

Pre-recorded add-on features (per-hour, on top of base)

Add-on	Universal-3.5 Pro	Universal-2
Keyterms Prompting	+$0.05/hr	Included
Prompting (beta)	+$0.05/hr	Not supported
Speaker Diarization	+$0.02/hr	+$0.02/hr
Medical Mode (new)	+$0.15/hr	+$0.15/hr

Realtime Speech-to-Text API (Streaming)

Model	Price per hour	Use case
Universal-3.5 Pro Realtime (new)	$0.45	Highest-accuracy real-time transcription
Universal-Streaming	$0.15	Cost-effective real-time, English-only
Universal-Streaming Multilingual	$0.15	Multilingual at the speed and cost of Universal-Streaming (English, Spanish, German, French, Portuguese, Italian)

Streaming is billed by WebSocket session duration. Concurrency scales automatically at no additional fee (pay-as-you-go starting limit is 100 sessions/min, auto-scaling up 10% whenever utilization hits 70%).

Voice Agent API

Product	Pay-as-you-go	Notes
Voice Agent API	$4.50/hr ($0.075/min)	Proprietary end-to-end Voice AI stack built on the Realtime Speech-to-Text API; volume-based pricing on request

Speech Understanding (per-hour add-ons)

Feature	Add-on pricing	Notes
Key Phrases	+$0.01/hr	Labels significant words and phrases
Speaker Identification	+$0.02/hr	Replaces “Speaker A/B” labels with real names or roles
Sentiment Analysis	+$0.02/hr	Detects sentiment of each sentence spoken
Custom Formatting	+$0.03/hr	Standardize and format specific types of information
Summarization	+$0.03/hr	Generate a summary of audio files at scale
Translation	+$0.06/hr	Convert content from one language to another
Entity Detection	+$0.08/hr	Identify entities that are spoken (names, emails)
Auto Chapters	+$0.08/hr	Time-based summary over audio/video
Topic Detection	+$0.15/hr	Label topics spoken in standardized IAB taxonomy

Guardrails (per-hour add-ons)

Feature	Add-on pricing	Notes
Profanity Filtering	+$0.01/hr	Filter out profanity from transcripts
PII Audio Redaction	+$0.05/hr	Identify and remove PII from the audio file
PII Text Redaction	+$0.08/hr	Identify and remove PII from the transcription text
Content Moderation	+$0.15/hr	Detect sensitive content in audio and video files

LLM Gateway (LLM-over-audio layer, per 1M tokens)

Model	Input / 1M	Output / 1M
GPT-5.5	$5.00	$30.00
GPT-5.2	$1.75	$14.00
GPT-5.1	$1.25	$10.00
Claude 4.8 Opus	$5.00	$25.00
Claude 4.6 Sonnet	$3.00	$15.00

The LLM Gateway also lists OpenAI, Anthropic, Google, and Other Providers tabs; rates vary by model. A token averages ~1.3 per English word; billing is on input + output tokens processed.

Sales motions across products: PLG / self-serve for all API products via API key; sales-led for Enterprise volume contracts, invoice billing, and AWS Marketplace. Speech Understanding, Guardrails, and the LLM Gateway are fully self-serve — no sales call required to enable.

Hidden costs : what surprises buyers when the bill arrives

Archetype A: Developer building a meeting transcription app

A developer processing 100 hours/month of meeting recordings on Universal-2, adding speaker identification and entity detection:

Line item	Per-hour cost	Monthly (100 hrs)
Base async transcription (Universal-2)	approximately $0.15	approximately $15.00
Speaker identification add-on	approximately $0.02	approximately $2.00
Entity detection add-on	approximately $0.08	approximately $8.00
Estimated total	approximately $0.25/hr	approximately $25/mo

The base headline rate of $0.15/hr is only the starting point. Common add-ons raise the effective rate meaningfully. Developers often build against the base rate and are surprised when the enriched transcript bill arrives. (Choosing Universal-3.5 Pro at $0.21/hr instead of Universal-2 adds roughly 40% to the base line item.)

Archetype B: Call center analytics platform at scale

A B2B SaaS company processing 2,000 hours/month of customer support calls on Universal-3.5 Pro with PII redaction, entity detection, and translation:

Line item	Per-hour cost	Monthly (2,000 hrs)
Base async transcription (Universal-3.5 Pro)	approximately $0.21	approximately $420
PII text redaction	approximately $0.08	approximately $160
Entity detection	approximately $0.08	approximately $160
Translation	approximately $0.06	approximately $120
LLM Gateway summarization (per-token, varies by model)	varies	varies
Estimated total	approximately $0.43/hr	approximately $860/mo

At higher volumes, the self-serve rate becomes worth an enterprise conversation. AssemblyAI’s sales team engages companies at high monthly volumes to offer volume pricing.

Want to model your own AssemblyAI spend? Use the AssemblyAI pricing calculator to estimate costs based on your audio volume, model selection, and feature mix.

Pricing evolution : how AssemblyAI’s pricing has changed since 2021

Cadence

Quarter	Price changes	Product / SKU additions	Notes
2021 Q2	1	1	Public API launched with pay-per-second billing; Series A raised
2022 Q4	0	2	Series B raised; Audio Intelligence add-ons (sentiment, entity, IAB) launched
2023 Q2	0	1	LeMUR launched in beta — token-based LLM-over-audio pricing introduced
2023 Q4	0	1	Universal-1 launched — same pricing tier, higher accuracy model
2024 Q1	0	1	Series C raised; Universal-2 released with improved benchmarks
2025 Q1	0	2	Universal-3.5 Pro (async) and Universal-3.5 Pro Realtime streaming launched
2026 Q2	0	0	Pricing stable as of May 2026 research
2026 Q3	0	3	2026-07-06: Voice Agent API launched ($4.50/hr); pricing page split into six product tabs; Guardrails broken out as a priced family; Medical Mode and Keyterms/Prompting add-ons added; Whisper-Streaming retired for Universal-Streaming Multilingual; async rates unchanged

Tracked range: 2021 Q2–2026 Q3. Quarters not listed above were verified stable (0 price changes, 0 SKU additions). Historical pricing pre-2021 was invite-only and not publicly documented.

Notable changes

2021 Q2 — Public pay-per-second billing launched; first developer self-serve access. Pricing set at $0.00025/second for standard transcription.
2022 Q4 — Audio Intelligence feature suite launched: sentiment analysis, entity detection, IAB topic classification, and content safety added as incremental per-second fees on top of base transcription.
2023 Q2 — LeMUR announced at beta: the first commercially-available LLM-over-audio API. Token-based pricing (input + output) introduced as a third billing dimension separate from transcription and add-ons.
2023 Q4 — Universal-1 released as the new highest-accuracy English STT model. Positioned as the same pricing tier as prior models but with significantly lower word error rate.
2024 Q1 — Universal-2 released following $50M Series C. Universal-2 achieved further accuracy gains over Universal-1 on standard English benchmarks. No pricing increase; same per-second rate.
2025 Q1 — Universal-3.5 Pro (async) and Universal-3.5 Pro Realtime streaming launched. Streaming priced at a premium to async to reflect lower-latency infrastructure costs.
2026-07-06 — AssemblyAI restructured its pricing page into six product tabs (Pre-recorded STT, Realtime STT, Voice Agent, Speech Understanding, Guardrails, LLM Gateway) and launched a standalone Voice Agent API at $4.50/hr ($0.075/min) — its first packaged full-stack voice product and, at roughly 30× the base async rate, its most expensive priced surface. In the same move it broke Guardrails out as a distinct priced family (Profanity Filtering $0.01/hr through Content Moderation $0.15/hr), added Medical Mode ($0.15/hr) and Keyterms/Prompting ($0.05/hr) as pre-recorded add-ons, retired Whisper-Streaming in favor of Universal-Streaming Multilingual ($0.15/hr), and published explicit per-model token rates on the LLM Gateway (e.g. GPT-5.5 $5/$30, Claude 4.8 Opus $5/$25 per 1M). Base async rates were unchanged (Universal-2 $0.15/hr, Universal-3.5 Pro $0.21/hr).

The July 2026 pricing-transparency shift in detail

The most consequential thing about the July 6 restructure is not the new Voice Agent SKU — it is that AssemblyAI closed the exact transparency gaps this analysis previously flagged. Add-on and Guardrails rates, streaming rates, and per-model LLM Gateway token prices are now published on the public pricing page rather than buried in gated documentation. A buyer can now build a full total-cost model — base transcription plus every enrichment plus token spend — before creating an account or talking to sales. The move also reframes the product from “a transcription API with some add-ons” into six clearly-priced surfaces, and it pushes AssemblyAI up-stack: the Voice Agent API sells the finished voice-agent outcome (understand, reason, respond) rather than the raw transcript, which is where the pricing power in the audio-AI category is migrating.

What’s unique : differentiators in AssemblyAI’s pricing approach

1. Per-second granularity with no rounding penalty. AssemblyAI bills at the per-second level — a 90-second clip costs exactly $0.0225, not $0.03 (rounded to the minute). This is technically obvious but commercially significant: early speech API providers (and even current cloud incumbents) round up to the nearest 15 seconds or full minute. For developers processing large volumes of short clips — voicemails, social posts, support snippets — per-second billing can reduce costs by 30–50% versus per-minute rounding. See how usage metric design affects developer costs for why this matters in tool selection.

2. Composable Audio Intelligence: pay for features, not tiers. Unlike SaaS tools that gate feature sets behind plan tiers, AssemblyAI lets developers enable any combination of Audio Intelligence features per-request. Sentiment analysis on one request, speaker diarization + entity detection on another — each billed independently. This composable feature billing allows developers to precisely control costs and avoids paying for analysis they don’t need. It mirrors how AWS charges for individual cloud services rather than bundled “plans.”

3. LeMUR: the first token-billed audio LLM API. When LeMUR launched in 2023, it introduced a novel billing layer to the speech category: token-based LLM pricing applied to audio context. Developers could ask natural-language questions about a transcript — “summarize the action items from this meeting” — and pay per token for the answer. This created a new cost dimension that no other speech API offered, and it mirrors the outcome-based pricing trend where customers pay for derived value (the answer) rather than raw processing (the transcript).

4. Free playground before payment commitment. AssemblyAI’s dashboard playground lets developers test every model — including LeMUR — with real audio before entering any payment details. In a category where competitors often require API key purchase to begin testing, this friction-free evaluation path is a meaningful PLG differentiator. It reflects the shift toward product-led growth in AI infrastructure where developer trust is won at the keyboard before the wallet.

5. Model accuracy as a pricing anchor, not a pricing gate. AssemblyAI positions its model improvements (Universal-1 → Universal-2 → Universal-3.5 Pro) at the same pricing tier rather than charging premium rates for higher-accuracy models. This is the opposite of the tiered-model pricing strategy used by OpenAI (GPT-4 costs more than GPT-3.5) or Anthropic (Opus costs more than Haiku). AssemblyAI’s approach bets that accuracy leadership drives adoption volume, and volume drives enterprise upsell — the premium is captured at the contract level, not the per-token level.

6. A single all-in Voice Agent price on top of the à-la-carte stack. With the July 2026 launch of the Voice Agent API at $4.50/hr ($0.075/min), AssemblyAI now sells two shapes of the same underlying capability: the composable, pay-for-what-you-enable stack (transcription + Speech Understanding + Guardrails + LLM Gateway tokens) for teams that want control, and one bundled per-hour rate for teams that just want a working voice agent without assembling the pipeline. That $4.50/hr headline is roughly 30× the $0.15/hr base transcription rate — a deliberate signal that the value is in the finished full-duplex agent, not the words. It mirrors the broader outcome-based pricing move toward charging for the delivered result rather than the raw processing step.

Strengths & weaknesses

Strengths	Weaknesses
Per-second billing with no rounding eliminates penalty for short-clip processing	Multiple billing dimensions (transcription + per-feature add-ons + LeMUR tokens) make total cost hard to predict without calculator
Public pricing (as of the July 2026 restructure) now lists every rate — Guardrails family, add-ons, streaming, and per-model LLM Gateway tokens — so buyers can self-model total cost before signing up	Six priced surfaces plus per-feature add-ons and per-model tokens make total cost hard to model by hand; a heavily-enriched transcript stacks many line items
Composable stack: enable exactly the transcription, Speech Understanding, Guardrails, and LLM Gateway calls you need per request	Voice Agent API at $4.50/hr is ~30× the base transcription rate — attractive convenience but a steep premium teams may not notice until the bill arrives
Model improvements (Universal-1 → Universal-2 → Universal-3.5 Pro) at the same price tier	Enterprise volume pricing still opaque; sales process required for any discount
LLM Gateway uniquely enables LLM reasoning on audio without building a custom pipeline, now with published per-model token rates	No spend caps or budget alerts on self-serve accounts — surprise bills possible for high-volume batch jobs
Genuine accuracy leadership on English benchmarks vs. Whisper, Deepgram, Google STT	Pricing for non-English languages not prominently documented; accuracy benchmarks primarily cover English

Billing UX : developer experience with AssemblyAI’s billing controls

Monthly usage-based invoicing — bills are generated at the start of each month for the previous month’s actual usage; no minimum commitments, upfront fees, or contracts on the pay-as-you-go plan.
No-credit-card free start — create an account and start transcribing immediately; the free tier includes up to 185 hours of pre-recorded transcription and up to 333 hours of streaming transcription.
Automatic concurrency scaling — the free plan allows 5 new streaming connections/min; pay-as-you-go starts at 100 sessions/min and auto-increases 10% whenever utilization hits 70%, with no ceiling and no additional fee.
Per-channel multichannel metering — a 1-hour stereo (2-channel) file is billed as 2 hours; each channel is transcribed independently.
Playground — no payment required — The in-dashboard playground allows testing of all Voice AI models and the LLM Gateway with real audio before any billing is set up. This is the most developer-friendly evaluation UX in the category.
No spend caps on self-serve — AssemblyAI does not currently expose configurable spending limits for self-serve accounts. A batch job that processes unexpectedly large audio volumes bills without notification — a known friction point for cost-sensitive developers.
Billing granularity — Bills are itemized by API product (transcription, Speech Understanding feature, Guardrails feature, LLM Gateway tokens), showing the audio duration processed and features enabled per request.
AWS Marketplace billing — usage can be consolidated through an existing AWS account (contact sales to set up); Enterprise accounts add invoice billing, purchase orders, and custom contract payment terms.
Enterprise account management — Enterprise customers receive custom rate limits, enhanced concurrency, a dedicated account manager, and usage-forecasting support via the “Talk to our team” path on every product tab.

Strategic wins : where AssemblyAI’s pricing decisions have paid off

1. Per-second billing made AssemblyAI the default choice for short-clip use cases

By charging at the per-second level rather than rounding to the minute, AssemblyAI structurally won the economics for developers processing short audio — voicemails (20–60 seconds), social media clips (15–60 seconds), podcast excerpts (30–90 seconds). A company processing 1 million 30-second clips per month pays AssemblyAI $7,500 versus $15,000 at a per-minute-rounded competitor — the same audio at 2× the cost. This per-second value metric is the single pricing decision that most clearly explains AssemblyAI’s developer adoption curve in media and social application categories.

2. Composable Audio Intelligence created a flywheel of feature adoption without tier lock-in

By pricing Audio Intelligence as per-request add-ons rather than plan tiers, AssemblyAI gave developers the freedom to start with base transcription and incrementally adopt higher-value features as product needs evolved. A developer who starts with basic transcription at $0.015/min naturally discovers speaker diarization when their users ask “who said what?” — and enables it at marginal cost. This composable model drives organic feature expansion that would not happen if features were gated behind fixed tiers requiring a plan upgrade. The result: higher feature adoption rates than a tier-gated model would produce, and higher average revenue per customer over time.

3. LeMUR differentiated AssemblyAI beyond the “just transcription” category

The 2023 launch of LeMUR moved AssemblyAI from being a transcription API vendor to being an audio intelligence platform — a category with significantly higher defensibility and pricing power. Before LeMUR, the main competitive variables were accuracy and price per minute. After LeMUR, AssemblyAI offered a capability that no other speech API could replicate: LLM-quality reasoning applied directly to audio content, without requiring a customer to build their own transcript → LLM pipeline. This outcome-based value layer — “what does this meeting mean?” rather than “give me the words” — justified enterprise conversations that raw transcription pricing alone could not support.

4. Accuracy leadership at parity pricing created switching cost without raising rates

AssemblyAI’s model release cadence (Universal-1 → Universal-2 → Universal-3.5 Pro) at unchanged per-second pricing created a powerful retention mechanism: customers who switched to AssemblyAI for Universal-2’s accuracy gains would be irrational to leave when Universal-3.5 Pro launches at the same rate. This is the inverse of the pricing strategy used by most AI model companies, where new model generations come with price increases. By keeping rates flat while improving accuracy, AssemblyAI accumulates a technical switching cost — the customer’s application is tuned to AssemblyAI’s output format, API behavior, and accuracy characteristics — without imposing a financial switching cost that might prompt re-evaluation. See how AI companies use model improvements as retention tools for the broader pattern.

Areas to improve : gaps and friction in AssemblyAI’s pricing approach

1. Six priced surfaces now need a first-party cost estimator, not just a rate card

The July 2026 restructure closed the transparency gap this analysis previously flagged: add-on rates, the Guardrails family, streaming rates, and per-model LLM Gateway token prices are now all published on the public pricing page, so a buyer can finally self-qualify without an account. The new problem is the flip side of that transparency — with six product tabs, per-feature add-ons, and per-model token pricing, a realistic voice-agent or call-analytics workload now spans many stacked line items that are tedious to total by hand. AssemblyAI publishes the rates but not a way to combine them. An interactive cost estimator on the pricing page — pick a model, toggle the add-ons and Guardrails you need, set expected volume — would let developers self-qualify their budget fit in one screen instead of spreadsheet arithmetic. Compare this against Perplexity AI’s fully public API pricing, which pairs transparent rates with a simpler mental model.

2. No spend caps create bill shock risk for self-serve developers

AssemblyAI’s self-serve accounts do not support configurable spending limits or threshold alerts. A developer who accidentally submits a batch of 10,000 long audio files will receive the bill without any real-time warning. AWS, Google Cloud, and Azure all offer budget alerts and spending caps as standard account management features — AssemblyAI’s absence of these controls is a category gap that creates anxiety for cost-sensitive development teams. Adding a simple “alert me when my monthly bill exceeds $X” setting would reduce developer anxiety and likely increase API adoption from teams that are currently cautious about unexpected charges.

3. The Voice Agent premium needs a “build-vs-buy” cost comparison

With the July 2026 launch, streaming rates are now published (Universal-Streaming and Universal-Streaming Multilingual at $0.15/hr, Universal-3.5 Pro Realtime at $0.45/hr), so the old streaming-opacity gap is resolved. The new decision buyers face is the Voice Agent API at $4.50/hr versus assembling the same outcome themselves from Realtime STT + LLM Gateway + a text-to-speech layer. At roughly 30× the base transcription rate, the bundled convenience price is easy to under-appreciate until volume scales. AssemblyAI would reduce hesitation — and defend the premium — by publishing a side-by-side of the bundled Voice Agent rate against the à-la-carte component stack, so buyers can see exactly what the packaging is worth at their volume. See billing cycles and metering for usage-based APIs for why transparent per-workload cost modeling drives developer confidence.

Monetization stack & signals : how AssemblyAI builds & buys its revenue engine

Buys 0 Builds 2 2 signal roles

The read — where the monetization investment is going

AssemblyAI builds the meter and the gateway behind its usage pricing in-house — no third-party billing vendor surfaces. The signal to watch is the founding Enterprise AE hire below: a sales-led enterprise motion forming on top of the self-serve API.

Stack — build vs buy

Builds in-house · 2

LLM Gateway (in-house multi-provider API) In-house build Job post Jan 2026

“This role is focused on building and maintaining our LLM gateway service—a unified API platform that connects customers to multiple LLM providers ... Build and maintain integrations with multiple LLM providers and AI services (OpenAI, Anthropic, Google Vertex, AWS Bedrock etc.)”
Metering Metering inferred Docs Jun 2026

“With the current version of multi-project support, rate limiting is applied at the account level, not at the project level. The rate limit is the maximum number of transcription jobs that can actively process simultaneously.”

What the hiring reveals

View open roles

Founding Enterprise Account Executive Growth Jun 4, 2026

The "founding" enterprise AE owns strategic accounts and "positions value and pricing against alternatives" — the first dedicated sales hire layering a sales-led motion onto a self-serve, usage-priced API core.
Senior Software Engineer, Go - LLM Team Billing engineering seen Jan 12, 2026

Staffing the in-house LLM Gateway — "a unified API platform that connects customers to multiple LLM providers" (OpenAI, Anthropic, Vertex, Bedrock). The metered token layer behind AssemblyAI's per-token pricing is a build, not a bought billing platform.

1 more matched role — supporting evidence

Applied AI Engineer Customer success Jun 18, 2026

Signals reviewed Jun 2026 · derived from public job posts, product docs

Job postings fill and close over time — once a posting is filled we keep it as a dated citation (the quoted evidence remains); use View open roles for current listings.

Key takeaways

Per-second billing is a competitive moat in short-clip categories. AssemblyAI’s per-second granularity systematically halves costs for developers processing sub-60-second audio compared to per-minute-rounded competitors. For any AI API, the choice of billing unit (per second, per minute, per request, per token) is a strategic decision that shapes which use cases become economically viable on your platform.
Composable feature pricing beats tier gating for developer adoption. By making Audio Intelligence add-ons opt-in per request rather than bundled into fixed tiers, AssemblyAI drives organic feature adoption as developer products mature. Developers discover features at the point of need, not at the point of plan selection — which is an earlier and lower-intent moment in the product lifecycle.
Model accuracy improvements at flat pricing create powerful retention without visible lock-in. Releasing Universal-2 and Universal-3.5 Pro at the same per-second rate as Universal-1 builds technical switching cost (tuned prompts, output parsing, latency expectations) without the financial switching cost that prompts re-evaluation. This is a sustainable retention mechanic that subscription-based tools with static feature sets cannot easily replicate.
A free playground with no payment commitment is the highest-leverage PLG investment for API products. AssemblyAI’s no-card-required playground gives every curious developer a zero-friction path to experience the product quality. In a category with strong alternatives (Google STT, AWS Transcribe, Deepgram), developer experience before purchase is often the decisive factor.
Adding LLM reasoning as a billing layer elevated the pricing conversation from commodity to platform. LeMUR transformed AssemblyAI from a transcription API (priced per minute of audio) to an audio intelligence platform (priced for the value of insights derived from audio). This layered value architecture — raw processing + structured analysis + LLM reasoning — is a model for how audio AI, and AI APIs broadly, will expand pricing power as capabilities mature.

UBP implications

Multi-dimensional usage billing (transcription + features + tokens) is the emerging standard for audio AI. AssemblyAI’s three-layer billing model — base per-second transcription, per-feature Audio Intelligence add-ons, and per-token LeMUR — represents the most granular usage billing in the speech API category. As audio AI capabilities compound (transcription → analysis → reasoning), each new capability layer will carry its own billing dimension. Teams building usage aggregation systems for audio AI products need to account for multi-dimensional metering from the start, not just per-minute billing.
Transparent public pricing accelerates developer self-qualification and shortens sales cycles. AssemblyAI’s July 2026 restructure — publishing every rate across six product tabs, from the Guardrails family to per-model LLM Gateway tokens — is a deliberate bet that full public pricing transparency reduces friction faster than a concierge sales motion. The UBP lesson: once your rate card is complex, the constraint shifts from disclosure to modelability — publishing the numbers is necessary but a self-serve cost estimator is what actually lets buyers self-qualify and convert without a sales call, at lower CAC.
Accuracy improvements at flat pricing are a usage-based growth strategy in disguise. When AssemblyAI releases Universal-3.5 Pro at the same rate as Universal-2, existing customers don’t churn — they simply produce better outputs at the same cost, which makes their products better, which drives more audio volume through AssemblyAI’s infrastructure. Higher accuracy → better customer products → more usage volume → more revenue at unchanged per-unit price. This usage-led expansion mechanic is a UBP growth pattern that pure subscription models cannot replicate.

Sources

AssemblyAI pricing page (accessed 2026-07-06)
AssemblyAI machine-readable pricing (pricing.md) (accessed 2026-07-06)
AssemblyAI official documentation (accessed 2026-05-29)
AssemblyAI Python SDK — GitHub repository (accessed 2026-05-29)
AssemblyAI Node.js SDK — GitHub repository (accessed 2026-05-29)
SourceForge — AssemblyAI product listing with per-second pricing (accessed 2026-05-29)
AssemblyAI company overview and Fortune 500 customer claims (accessed 2026-05-29)

Bottom line

AssemblyAI has built the most developer-friendly speech AI pricing structure in the market: pay-per-second with no rounding, composable Audio Intelligence add-ons, and a free playground that requires no payment commitment. Its model release cadence — Universal-1, Universal-2, Universal-3.5 Pro — at flat rates is a quiet retention machine that builds technical switching cost without triggering financial re-evaluation. The July 2026 restructure into six priced product tabs closed the old transparency gaps — every add-on, Guardrails, streaming, and per-model token rate is now public — and pushed the company up-stack with a bundled Voice Agent API at $4.50/hr that sells the finished outcome rather than the raw transcript. The remaining gaps — no spend caps and no first-party estimator for what is now a genuinely multi-line-item bill — are fixable and do not undermine the core economic model. With $143M raised, Fortune 500 enterprise penetration, and the LLM Gateway as a differentiated platform layer, AssemblyAI is the clear default choice for developers building audio intelligence into production products.

Compare AssemblyAI with other AI infrastructure providers in the full pricing blueprint.

Pricing timeline : Major events on a vertical axis

Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.

Voice Agent API launched; pricing page restructured into six product tabs

Jul 2026

Pricing page restructured into six product tabs (Pre-recorded STT, Realtime STT, Voice Agent, Speech Understanding, Guardrails, LLM Gateway). New Voice Agent API listed at $4.50/hr ($0.075/min). Async unchanged ($0.15 Universal-2, $0.21 Universal-3.5 Pro). Realtime lineup swapped Whisper-Streaming for Universal-Streaming Multilingual ($0.15/hr) alongside Universal-3.5 Pro Realtime ($0.45/hr). Guardrails broken out as a priced family (Profanity Filtering $0.01/hr → Content Moderation $0.15/hr); Medical Mode ($0.15/hr) and Keyterms/Prompting ($0.05/hr) added as pre-recorded add-ons.

Voice Agent API launched; pricing page restructured into six product tabs screenshot 1

Voice Agent API launched; pricing page restructured into six product tabs screenshot 2

Current Pricing (May 2026)

May 2026

Current pricing: async transcription at $0.15/hr (Universal-2) and $0.21/hr (Universal-3.5 Pro); streaming at $0.15/hr (Universal-Streaming), $0.30/hr (Whisper-Streaming), and $0.45/hr (Universal-3.5 Pro Streaming). Speech Understanding add-ons billed per hour; LLM Gateway billed per million tokens. Free tier of up to 185 pre-recorded hours.

Universal-3.5 Pro and Streaming Models Released

Jan 2025

Universal-3.5 Pro released as the most accurate async model; Universal-3.5 Pro Streaming joined Universal-Streaming and Whisper-Streaming for real-time use. Pricing moved to a per-hour structure across models.

Series C ($50M) — Universal-2 Released

Jan 2024

Series C funding ($50M) led by Accel, bringing total funding to ~$143M. Universal-2 released, achieving further accuracy gains. Company reports processing audio for hundreds of enterprise customers including dozens of Fortune 500 companies.

Universal-1 Model Released

Nov 2023

Universal-1 launched as AssemblyAI's flagship model for highest-accuracy English transcription, with accuracy benchmarks beating Whisper large-v3 and Google STT v2.

LeMUR Beta — LLM-over-Audio Layer

Jun 2023

LeMUR launched in beta — the first LLM-over-audio layer in the speech API category. LeMUR adds token-based billing on top of transcription costs.

Series B ($72M) — Audio Intelligence Add-ons

Oct 2022

Series B funding ($72M) led by Insight Partners. Usage-based API pricing at $0.00025/second confirmed for standard transcription. Audio Intelligence add-ons (sentiment, entities, IAB topics) launched as incremental per-second fees.

Series A ($28M) — Public API Launch

Jun 2021

Series A funding ($28M) from Insight Partners. Public API access expanded; pay-per-second billing model launched publicly.

AssemblyAI Founded

Jan 2017

AssemblyAI founded in San Francisco as a speech-to-text API startup. Early pricing was invite-only for select beta customers.

Trivia

· AssemblyAI bills transcription by the hour of audio processed — Universal-2 at $0.15/hr and the more accurate Universal-3.5 Pro at $0.21/hr — with no minimum commitment, upfront fee, or contract on the pay-as-you-go plan.
· AssemblyAI's LLM Gateway lets developers call frontier models (OpenAI, Anthropic, Google) directly against a transcript, billed per million input and output tokens — the evolution of what AssemblyAI first shipped as LeMUR, its 'LLM-over-audio' layer.
· AssemblyAI raised $50M in its Series C in January 2024, bringing total funding to approximately $143M. The round was led by Accel, with participation from Insight Partners, and came just two months after Universal-1 launched as the company's flagship accuracy benchmark.

Questions & answers

How much does AssemblyAI cost per hour of audio?: AssemblyAI bills async transcription by the hour: Universal-2 is $0.15/hr and the more accurate Universal-3.5 Pro is $0.21/hr. Real-time streaming ranges from $0.15/hr (Universal-Streaming) to $0.45/hr (Universal-3.5 Pro Realtime). A Voice Agent API is $4.50/hr ($0.075/min). Speech Understanding and Guardrails add-ons each carry an additional per-hour fee on top of the base rate.
Does AssemblyAI have a free tier?: Yes. You can create an account and start transcribing immediately with no credit card required. The free tier includes up to 185 hours of pre-recorded transcription and up to 333 hours of streaming. There is no monthly minimum — beyond the free tier you only pay for what you process.
What is the LLM Gateway and how is it priced?: The LLM Gateway is AssemblyAI's layer that lets developers run frontier LLMs from OpenAI, Anthropic, and Google directly against a transcript — summarization, Q&A, custom prompts. It is billed per million input and output tokens at each model's published rate, separately from transcription costs. It is the productized evolution of LeMUR.
What is the difference between AssemblyAI's Universal-2 and Universal-3.5 Pro models?: Universal-2 is the lower-cost async model at $0.15/hr, supporting 99 languages with excellent accuracy. Universal-3.5 Pro is the most accurate model at $0.21/hr, leading on multilingual word error rate. For real-time use, AssemblyAI offers Universal-Streaming and Universal-Streaming Multilingual (both $0.15/hr) and Universal-3.5 Pro Realtime ($0.45/hr).
Does AssemblyAI offer enterprise pricing?: Yes. Enterprise customers receive custom volume-discount pricing, dedicated support, and invoice billing, and AssemblyAI is available via the AWS Marketplace. Self-serve customers pay standard per-hour rates with no sales contact required.
How do Speech Understanding add-ons affect AssemblyAI's pricing?: Each Speech Understanding feature — speaker identification, entity detection, translation, PII redaction, and more — adds an incremental per-hour fee on top of base transcription. Enabling multiple features stacks those per-hour fees, so a heavily-featured transcript costs more than the base transcription rate.