Small Models and Parallel AI Rewrite Billing
Parallel token generation and on-device SLMs are breaking token-based AI billing. Flat subscriptions cover local inference; usage overages handle cloud fallback.
AI Summary
When Cheaper Becomes Complicated
A product manager at a software company receives two technical proposals from her engineering team, both claiming to solve the same problem: AI costs are growing faster than revenue. The first proposal suggests deploying diffusion-based language models like Mercury Coder that generate tokens in parallel rather than sequentially, promising ten-times faster inference with proportional cost savings. The engineering team shows her benchmarks: Mercury generates code at over one thousand tokens per second on H100 GPUs versus roughly one hundred tokens per second from traditional models, ranks first for speed and second for quality on Copilot Arena, and maintains API compatibility with existing integrations. See the AI token pricing tracker to compare current per-token rates across models as efficiency gains translate into price cuts. The second proposal recommends migrating to small language models running directly on users’ devices, eliminating cloud costs entirely for eighty to ninety percent of queries. Both proposals are technically sound. Both would dramatically reduce costs. But they create completely different billing challenges.
The parallel generation models would require tracking generation steps and denoising iterations alongside tokens, creating multi-dimensional metering complexity that current billing systems aren’t designed to handle. Though Mercury has shown one path forward by maintaining token-based pricing despite their different architecture, adapting this approach requires understanding how their coarse-to-fine parallel generation translates to costs. The small models would shift the business from usage-based cloud billing to seat-based licensing since inference happens locally on devices customers control. Neither approach fits the token-based pricing model the company spent two years building and that customers finally understand.
This scenario is playing out across the AI industry as two distinct efficiency revolutions collide with established monetization frameworks. On one side, diffusion-based language models and parallel token generation techniques are changing how inference works, creating speedups that make the sequential token generation of traditional models look wasteful. On the other side, small language models are proving that most AI workloads don’t need the massive capabilities of frontier models and can run locally on everyday devices at a fraction of the cost. Both trends promise dramatic efficiency gains. Both will reshape the economics of AI-powered products. Both require rethinking billing infrastructure in ways that could unlock new business models or create chaos, depending on how companies handle the transition.
Both technologies need examination: what’s actually happening with each, why they matter for AI pricing, what they mean for billing infrastructure, and how companies should prepare for a world where efficient AI looks completely different from the AI we’ve been building billing systems for.
Understanding Parallel Token Generation
The billing implications of parallel token generation depend on understanding what it actually is and why it represents such a departure from how language models have worked. This technical distinction matters because it changes what you’re billing for when customers use these models.
Traditional language models, from GPT-3 through the current generation, use autoregressive generation. These models produce output one token at a time in strict left-to-right sequence. The model predicts the first token, then uses that token to help predict the second, then uses both to predict the third, and so on. This sequential dependency limits how fast models can produce output. For a response that’s a thousand tokens long, you need to make a thousand sequential predictions. Generating a thousand-token response with GPT-4 might take ten to fifteen seconds — acceptable for many use cases but too slow for interactive applications where users expect near-instant responses.
Diffusion-based language models take a completely different approach inspired by diffusion models that transformed image generation. Instead of generating tokens one by one from left to right, these models start with a sequence where all positions are masked or filled with noise, then iteratively refine multiple tokens in parallel through a denoising process. In each iteration, the model unmasks or refines several tokens simultaneously based on the context provided by both the prompt and the partially unmasked sequence. Over multiple iterations, the complete output emerges.
The most visible production deployment of this technology is Mercury from Inception Labs. Mercury Coder Mini generates code at 1,109 tokens per second on NVIDIA H100 GPUs, while Mercury Coder Small achieves 737 tokens per second — approximately ten times faster than the fastest frontier autoregressive models optimized for speed. On Copilot Arena, Mercury ranks second for quality behind only Claude Sonnet 4 while ranking first for speed. Mercury’s approach, which they call coarse-to-fine parallel generation, works by first generating a rough draft of the entire output in parallel, then iteratively refining that draft through multiple passes. The company secured $50 million in Series A funding from Menlo Ventures in late 2025. Importantly, Mercury offers API-compatible drop-in replacements for OpenAI’s Codex models, meaning developers can switch without changing their integration code.
Beyond Mercury, research shows even more dramatic potential speedups. Adaptive Parallel Decoding achieved up to twenty-two-times speedup on benchmark tasks compared to autoregressive generation, and combined with optimizations like KV caching, the speedup reached fifty-seven-times.
The speedup creates billing complexity. Diffusion models don’t simply generate tokens faster — they generate tokens through a different process. The number of denoising steps required varies significantly based on output complexity and quality targets. A simple response might converge in five to ten steps, while a complex code generation task might require twenty to thirty steps. The relationship between computational cost and output tokens becomes non-linear and unpredictable. In autoregressive models, generating a thousand tokens requires roughly a thousand sequential prediction steps. In diffusion models, generating a thousand tokens might require anywhere from five to thirty denoising iterations. A response twice as long doesn’t necessarily cost twice as much to generate if both can be denoised in roughly the same number of iterations.
Mercury resolved this billing challenge by maintaining API compatibility with OpenAI’s pricing structure, charging customers based on output tokens despite their different generation process. This prioritizes customer understanding and ease of adoption over accurate cost attribution. By keeping the billing unit as tokens, Mercury positions itself as a faster, cheaper drop-in replacement without requiring customers to learn new billing concepts. The speedup advantage means they can charge lower per-token rates while maintaining healthy margins because their infrastructure costs per token are dramatically lower.
Diffusion models are still relatively immature compared to autoregressive models. The ecosystem of optimization techniques that make autoregressive models efficient — speculative decoding, prefix caching, chunked prefill — don’t have direct equivalents yet for diffusion models. This means the cost structure is less predictable and more variable across different implementations. Companies deploying diffusion models are still learning what costs look like in production, making it premature to establish stable pricing beyond what early movers like Mercury have demonstrated.
If diffusion models become mainstream beyond latency-sensitive niches like coding assistance, billing infrastructure will need to support multi-dimensional usage metering that tracks not just output tokens but also generation complexity, iteration counts, and decoding strategies. The shift from linear token-based billing to multi-dimensional parallel generation billing represents a complexity increase comparable to the shift from seat-based to usage-based billing that SaaS went through over the past decade.
The Small Model Revolution: Edge Intelligence Changes Everything
The rise of small language models running directly on users’ devices represents a change in where AI computation happens — and therefore how it can be monetized.
“Small” in this context means relative to massive frontier models. While GPT-5 has over a trillion parameters, small language models typically range from under a billion to around twelve billion parameters. Models like Phi-4 with fourteen billion parameters, Llama 3.2 with one to three billion parameters, or Mistral 7B represent this category. These models are between ten and one hundred times smaller than frontier models, and they’re small enough to run on consumer hardware. A three billion parameter model can run on a modern smartphone. A seven billion parameter model runs comfortably on a laptop.
The economics are stark. Industry data shows that serving a seven billion parameter SLM costs ten to thirty times less than running a seventy to 175 billion parameter LLM for comparable workloads. Some reported deployments achieved 99.98% cost reduction by migrating from GPT-4 API usage at $4.2 million annually to self-hosted SLMs at under $1,000 annually. Use the OpenAI pricing calculator to model what your current API volumes would cost under self-hosted SLM pricing.
The most significant economic shift isn’t that small models are cheaper to run — it’s that they can run on-device, eliminating cloud costs entirely for queries that don’t require server-side processing. When an AI assistant runs directly on your phone or laptop, the marginal cost to the software vendor for each additional query is zero once the model is deployed.
The performance of small models has improved dramatically. On domain-specific tasks after fine-tuning, SLMs often match or exceed LLM accuracy. A seven billion parameter legal SLM fine-tuned on contracts achieves 94% accuracy versus GPT-5’s 87% on the same task according to production deployment data. A three billion parameter model fine-tuned on insurance claims processes 2,000 documents hourly at 96% accuracy versus GPT-5’s 500 per hour at twenty times the cost.
Three forces are driving SLM adoption. Cost pressure is most acute among enterprises processing millions of AI queries monthly where cloud API costs have become material line items. Privacy and compliance requirements in regulated industries like healthcare, finance, and government mean sensitive data never needs to leave the organization’s control. The latency advantage of local inference — milliseconds versus hundreds of milliseconds for cloud round-trips — enables use cases like real-time voice interaction or instant code completion.
Current adoption patterns show enterprises taking hybrid approaches. The emerging architecture runs small models on-device or in private cloud for routine, high-frequency tasks and routes complex or critical queries to large cloud models. This handles 90–95% of queries with cheap local models while reserving expensive cloud models for the 5–10% that genuinely need advanced capabilities. Real-world deployments from companies like Commonwealth Bank of Australia running over 2,000 AI models in production demonstrate that this isn’t experimental — it’s becoming enterprise standard.
The billing challenge is that traditional usage-based pricing doesn’t work when most usage happens locally where the vendor can’t meter it. If 90% of your customers’ AI queries run on their devices using models they downloaded once, you can’t charge based on query volume or token consumption. This forces a return to seat-based licensing, capacity-based pricing, or subscription tiers based on capabilities rather than consumption.
Some companies are experimenting with hybrid billing models that combine subscription fees for local model access with usage-based charges for cloud fallback queries. Customers pay a monthly subscription that includes downloading and running small models on their devices, plus separate charges when they exceed local model capabilities and need cloud inference.
The SLM market is valued at $930 million in 2025 and projected to reach $5.45 billion by 2032, representing 28.7% compound annual growth. By 2027, over two billion smartphones are expected to run local SLMs. This is becoming the default deployment model for many categories of AI applications, forcing a rethinking of how AI gets monetized when computation increasingly happens on devices vendors don’t control.
The Billing Infrastructure Split: Matrix Versus Device
For parallel generation and diffusion models, the core billing question is what unit of measure makes sense when token generation is no longer sequential and when cost is driven by factors beyond simple token count. The most straightforward approach, which vendors like Mercury have adopted, is to continue billing based on output tokens while adjusting prices to reflect the lower cost of parallel generation. Customers already understand token-based pricing and have mental models of what a thousand tokens costs. Keeping the unit of measure as tokens and adjusting the price maintains this understanding while capturing the efficiency benefit.
The alternative is introducing billing dimensions that more accurately reflect the underlying cost drivers — something like: cost = output_tokens × base_rate + iterations × iteration_rate. This formula more accurately reflects that costs scale with both output length and number of denoising steps. But it creates significant complexity for customers who now need to understand and predict two variables instead of one.
A third approach discussed but not yet widely implemented is matrix-based metering where billing is based on total matrix operations required to generate the output, regardless of whether those operations happened sequentially or in parallel. This abstraction could work across both autoregressive and parallel generation models by charging for the fundamental unit of computation rather than for the output artifact. Matrix operations as a billing unit are even more abstract than tokens, though, potentially creating friction that outweighs the theoretical elegance.
For small language models running on-device, the billing challenge is different. You can’t meter usage in real-time because the computation happens on devices you don’t control. You could require models to phone home and report usage, but this creates privacy concerns, adds latency, won’t work offline, and can be circumvented by determined users.
This forces a shift toward billing models that don’t depend on usage metering. Seat-based licensing — charging per user or device that has access to the model, regardless of how much they use it — is making a comeback in the AI era for on-device models. Seat-based pricing for AI creates tiering opportunities based on model capabilities rather than usage volumes. A Basic tier might include access to one billion parameter models. A Professional tier includes seven billion parameter models. An Enterprise tier includes fourteen billion parameter models plus custom fine-tuning.
Hybrid models are emerging that try to capture both subscription value and usage value. A base subscription provides guaranteed local model access, and usage-based charges kick in for cloud services that require server-side processing.
Across both technologies, a pattern emerges: the efficiency gains that make these approaches attractive also complicate the billing models that made previous generations of AI simple to monetize. Token-based pricing worked well when every AI query went through centralized APIs that could meter consumption precisely. When queries are processed through parallel generation with variable iteration counts, or when they run locally on devices where metering is impractical, token-based billing breaks down. The billing infrastructure of the future needs to support multiple pricing paradigms simultaneously, flexible enough to handle usage-based, capacity-based, seat-based, and outcome-based models all within the same platform.
Looking Forward: The Efficiency-First Future
Efficiency will become the primary competitive battleground in AI over the next two to three years. The competition around which company builds the largest or most capable model will give way to competition around which company delivers comparable capabilities at the lowest cost and latency. As model capabilities converge toward “good enough” for most tasks, efficiency differentiates more than raw capability.
This focus will drive continued investment in both parallel generation techniques and small model optimization. Hybrid approaches will emerge that combine elements of both — perhaps using small models as draft generators that are refined through parallel diffusion processes. Companies that crack the code on delivering high-quality outputs at one-tenth the cost of current approaches will capture enormous market share.
Most AI products will use tiered architectures where simple, frequent tasks run on small local models and complex, occasional tasks route to large cloud models. For billing infrastructure, hybrid deployment creates pressure for unified pricing that abstracts away the complexity. Customers don’t want separate line items for local model licenses and cloud API consumption. They want single, predictable pricing that covers all AI capabilities.
New pricing models will emerge specifically designed for edge AI and local inference: subscription tiers based on which model sizes can run on-device, capacity licensing where you pay for the right to run up to a certain amount of inference locally, and outcome-based pricing that’s agnostic to where inference happens, charging for completed tasks whether they were handled locally or in the cloud.
Mercury’s approach of maintaining token-based pricing for customer simplicity while adjusting prices to reflect lower costs demonstrates one viable path forward. Vendors might keep familiar billing units while adjusting prices to reflect efficiency gains rather than introducing new metering dimensions that confuse customers. Or they might introduce simple multipliers where parallel generation is explicitly priced lower per token than sequential generation, creating tiering within the same fundamental unit of measure.
We’re entering an era where pricing models need to be as flexible and dynamic as the underlying technology. The days of establishing a pricing model and leaving it unchanged for years are over in AI. As new efficiency techniques emerge, as deployment patterns shift, as cost structures change, pricing needs to evolve continuously. This requires billing infrastructure that treats pricing as configuration data that can be updated without engineering changes.
Synthesis: Building Billing for the Efficiency Era
Building billing infrastructure that supports both parallel generation and small language models requires five capabilities.
The first is abstraction layers that decouple customer-facing pricing from underlying cost structure. Whether you’re using sequential token generation, parallel diffusion, large cloud models, or small on-device models shouldn’t matter to customers’ understanding of what they’re paying for. They should see pricing expressed in terms meaningful to them — completed tasks, API calls, query volumes, active users. Behind the scenes, your billing system converts their consumption into appropriate charges based on the actual infrastructure used. This requires mapping tables that connect customer-facing metrics to backend cost drivers, and those exchange rates need to be updateable as your infrastructure evolves.
The second is hybrid metering systems that can track both usage-based consumption for cloud services and subscription-based access for on-device models. The guide to understanding prepaid credit models covers the credit normalization patterns that bridge these two billing modes. Your billing platform needs subscription management capabilities for handling recurring charges, entitlements, upgrades, and downgrades, plus usage metering for capturing consumption events from APIs, all aggregated into unified invoices that make sense to customers.
The third is flexible pricing engines that support experimentation and rapid iteration. Your pricing strategy needs to evolve with the market, which requires infrastructure that makes pricing changes easy rather than requiring months of development work. This means treating pricing rules as data stored in databases or configuration files, not as code that requires compilation and deployment. The pricing engine should support A/B testing where you can offer different pricing to different customer cohorts.
The fourth is customer-facing analytics that show value delivered alongside costs incurred. If a customer consumed one million queries through small local models this month, show them that the same queries would have cost $5,000 through cloud APIs but their subscription costs only $500. This quantified savings builds appreciation for the efficiency value you’re delivering. The analytics should also guide customers toward more efficient usage patterns.
The fifth is dedicated pricing operations capability that monitors competitive dynamics, analyzes usage patterns, runs pricing experiments, and makes data-driven recommendations about pricing adjustments. This team should work closely with finance to ensure pricing changes maintain target margins, with product to ensure pricing aligns with roadmap and positioning, and with sales to ensure pricing helps close deals.
The parallel generation and small model revolutions represent a shift in AI economics. When models can be ten to fifty times faster or ten to one hundred times cheaper to run, the competitive dynamics change. The billing infrastructure you build today needs to support not just current token-based pricing but also the hybrid, seat-based, outcome-based, and usage-based models that these efficiency technologies will require. Companies that invest in this infrastructure now, treating billing flexibility as a strategic capability, will be positioned to capture the value from efficiency innovations as they emerge.
About This Series
The Future Ahead is a series exploring where the AI industry is heading and how it will fundamentally transform billing workflows, billing infrastructure, and pricing models.
Read Previous Articles:
- Part 1: The AI Billing Infrastructure Crisis
- Part 2: The Outcome-Based Pricing Revolution
- Part 3: The Token Cost Deflation Paradox
- Part 4: The Agentic AI Pricing Challenge
- Part 5: Multimodal Monoliths vs. Orchestrated Specialists
- Part 6: Beyond Agentic AI - Autonomous Services
- Part 7: Reasoning vs Inference Models
- Part 8: The Infrastructure Fork
- Part 9: The Margin Crisis
- Part 10: The Specialization Dilemma