Sharpens 9 companies · First observed August 2024 · Updated July 2026 Explore in the graph

Batch and cache discounts as a standard playbook

Quick answer

Inference vendors have converged on two standard discounts — roughly 50% off for latency-tolerant batch jobs and a reduced rate for cached input. If part of your workload is async or shares a stable prompt prefix, these are the highest-leverage cuts on a token bill.

~50% off for batch / async workloads

What's happening — and why

What's happening: token APIs now routinely offer two big discounts — roughly half price for 'batch' jobs you're willing to wait on, and a reduced rate for cached (repeated) input such as a fixed system prompt or RAG context.

Why: latency-tolerant and repetitive work is cheaper for the vendor to serve — it can be scheduled onto idle capacity or skip recomputation. Pricing it lower sorts that load onto cheaper infrastructure and rewards buyers for flexibility, all without touching the headline real-time rate.

How it works

Price falls as you trade latency: full rate → cached input → ~50%-off batch.

Evidence over time

11 supporting · 3 counter — hover or tap a point for detail, click to jump to the row.

supporting evidence counterexample

Evidence

Company	Date	What happened
OpenAI	May 2026	GPT-5.x line ships with Batch API (50% off) and prompt caching on the reduced cached-input rate — the standard pair on the flagship API.
Google	May 2025	Gemini 2.5 added Priority (1.8×) and Flex/Batch (0.5×) as named SKUs — explicit latency-tier price discrimination plus the 50% batch rate.
Fireworks AI	Mar 2025	Batch API at a flat 50% discount across all models.
Groq	May 2025	Batch API plus cached-input discounts launched together.
Anthropic	Aug 2024	Prompt caching cut input cost by up to 80%; batch API also offered at 50%.
Baseten	Feb 2026	Cached-input pricing added to multi-tenant Model APIs.
Fireworks AI	Nov 2024	Cached-input discount shipped alongside Turbo / Priority latency tiers.
Mistral AI	May 2026	Batch processing earns a 50% discount on per-token rates.
OpenRouter	Jun 2025	Multi-model routing marketplace that passes through upstream batch/cache discounts — extending the standard playbook to routing infrastructure without re-inventing the mechanics.
Anthropic	Jun 2026	Fast mode (research preview) prices the same model higher for lower latency — Opus 4.8 at $10/$50 vs standard $5/$25 (2× premium), Opus 4.6/4.7 at $30/$150 — the surcharge-for-speed half of the latency axis, alongside its existing 50% Batch and prompt-caching discounts. Not available on AWS or Batch.
DeepInfra	Jun 2026	Added a per-request Service Tier: Standard (1× base, default) or Priority (1.5× base, scheduled ahead of standard traffic for faster TTFT), set via service_tier: "priority". First self-serve per-request speed surcharge on a pure-usage per-token platform — the surcharge half of latency tiering reaching the multi-tenant inference layer.

Counterexamples

Suno · May 2026 — Consumer credit tiers — no batch or cache discount.
Midjourney · Feb 2025 — Uses fast vs relax compute modes instead of cache/batch — latency tiering by queue, not caching.
Lambda Labs · Jun 2026 — GPU cloud with no batch API or caching discount — the playbook is specific to token/inference APIs, not raw compute rental.

Trivia

Anthropic's August 2024 prompt-caching launch — cutting repeated-input costs by up to 80% — was the largest single-day effective price reduction in the corpus that was not accompanied by a new model release. It created a new cost category (cached vs uncached input) without changing the headline per-token rate, meaning vendors could simultaneously advertise stable pricing while offering dramatically lower effective costs to buyers with stable system prompts.
The ~50% batch discount has converged across at least 13 corpus vendors (Anthropic, OpenAI, Google, Mistral, Fireworks, Groq, Together, Replicate, and others) to the same approximate number — a rare case of industry-wide price coordination on a structural discount. The consistency suggests the discount reflects a genuine cost difference (asynchronous batching reduces per-token GPU utilisation variance) rather than arbitrary marketing.
Google's addition of a 1.8× Priority premium on Gemini 2.5 (May 2025) is the corpus's first example of latency tiering working in both directions simultaneously: a discount for tolerance (Flex/Batch at 0.5×) and a premium for impatience (Priority at 1.8×) on the same model. The bidirectional approach makes latency an explicit price axis rather than an implicit quality attribute — a structural move that no other corpus vendor has fully replicated.
Anthropic's June 2026 Fast mode inverts the usual generational price order: the newer Opus 4.8 costs $10/$50 in Fast mode while the older Opus 4.6/4.7 cost $30/$150 — 3× more — to run fast. Premium-speed pricing tracks inference efficiency, not recency, so the prior-generation flagship becomes the *expensive* one the moment latency is the SKU. It is the corpus's second bidirectional latency example after Google, and the first where the speed surcharge is steeper on the model you'd expect to be cheaper.
DeepInfra's June 2026 Priority tier (1.5× base) lands between Google's 1.8× and Anthropic's 2× speed premiums — the three bidirectional-latency vendors have independently converged on a 1.5×–2× surcharge band for priority scheduling, a tighter cluster than the discount side, where batch settled on a single ~50% number. The speed surcharge is the only price lever in the corpus that a buyer sets per individual request (service_tier: "priority") rather than per plan or per model.

See all pricing trivia

For buyers

If a meaningful share of your workload is asynchronous, batch roughly halves spend with no model change; if your prompts share a long stable prefix (system prompts, RAG context), caching compounds the saving. These are the highest-leverage, lowest-effort moves on a token bill.

For vendors

The playbook needs a batch queue with a relaxed SLA and a prompt-cache keyed on prefix hashes, each priced as its own line. The discounts are a segmentation tool — they sort latency-tolerant load onto cheaper infra without dropping your headline rate.

Outlook — what to watch

Expect these to become table stakes and to deepen: longer cache TTLs, automatic prompt-prefix caching, and tiered batch SLAs (1-hour vs 24-hour). The next frontier is priority/express pricing in the other direction — paying a premium for guaranteed low latency — turning latency into a full price axis.

Bottom line

Inference vendors have converged on ~50%-off batch and cached-input discounts as a de-facto standard. Anthropic, Mistral, Fireworks and Groq all land near the same numbers.

FAQ

How can I cut my LLM API bill without changing models?

Use batch processing for anything asynchronous (≈50% off) and prompt caching for repeated context like system prompts or RAG (a further large cut on input). Both are vendor-native and need no model change.

What is prompt caching?

A discount on input tokens that repeat across requests — the vendor caches a stable prefix (e.g. your system prompt) and charges a fraction of the normal rate to reuse it. Anthropic's cut input cost by up to 80%.

Which vendors offer batch and cache discounts?

It's now near-standard for token APIs — Anthropic, Mistral, Fireworks, Groq and Baseten all offer batch (~50%) and/or cached-input pricing. Consumer credit apps generally don't.

All trends