Batch and cache discounts as a standard playbook
Inference vendors have converged on two standard discounts — roughly 50% off for latency-tolerant batch jobs and a reduced rate for cached input. If part of your workload is async or shares a stable prompt prefix, these are the highest-leverage cuts on a token bill.
What's happening — and why
What's happening: token APIs now routinely offer two big discounts — roughly half price for 'batch' jobs you're willing to wait on, and a reduced rate for cached (repeated) input such as a fixed system prompt or RAG context.
Why: latency-tolerant and repetitive work is cheaper for the vendor to serve — it can be scheduled onto idle capacity or skip recomputation. Pricing it lower sorts that load onto cheaper infrastructure and rewards buyers for flexibility, all without touching the headline real-time rate.
How it works
Evidence over time
6 supporting · 2 counter — hover or tap a point for detail, click to jump to the row.
Evidence
| Company | Date | What happened |
|---|---|---|
| Fireworks AI | Mar 2025 | Batch API at a flat 50% discount across all models. |
| Groq | May 2025 | Batch API plus cached-input discounts launched together. |
| Anthropic | Aug 2024 | Prompt caching cut input cost by up to 80%; batch API also offered at 50%. |
| Baseten | Feb 2026 | Cached-input pricing added to multi-tenant Model APIs. |
| Fireworks AI | Nov 2024 | Cached-input discount shipped alongside Turbo / Priority latency tiers. |
| Mistral AI | May 2026 | Batch processing earns a 50% discount on per-token rates. |
Counterexamples
- Suno · May 2026 — Consumer credit tiers — no batch or cache discount.
- Midjourney · Feb 2025 — Uses fast vs relax compute modes instead of cache/batch — latency tiering by queue, not caching.
For buyers
If a meaningful share of your workload is asynchronous, batch roughly halves spend with no model change; if your prompts share a long stable prefix (system prompts, RAG context), caching compounds the saving. These are the highest-leverage, lowest-effort moves on a token bill.
For vendors
The playbook needs a batch queue with a relaxed SLA and a prompt-cache keyed on prefix hashes, each priced as its own line. The discounts are a segmentation tool — they sort latency-tolerant load onto cheaper infra without dropping your headline rate.
Outlook — what to watch
Expect these to become table stakes and to deepen: longer cache TTLs, automatic prompt-prefix caching, and tiered batch SLAs (1-hour vs 24-hour). The next frontier is priority/express pricing in the other direction — paying a premium for guaranteed low latency — turning latency into a full price axis.
Bottom line
Inference vendors have converged on ~50%-off batch and cached-input discounts as a de-facto standard. Anthropic, Mistral, Fireworks and Groq all land near the same numbers.
FAQ
How can I cut my LLM API bill without changing models?
Use batch processing for anything asynchronous (≈50% off) and prompt caching for repeated context like system prompts or RAG (a further large cut on input). Both are vendor-native and need no model change.
What is prompt caching?
A discount on input tokens that repeat across requests — the vendor caches a stable prefix (e.g. your system prompt) and charges a fraction of the normal rate to reuse it. Anthropic's cut input cost by up to 80%.
Which vendors offer batch and cache discounts?
It's now near-standard for token APIs — Anthropic, Mistral, Fireworks, Groq and Baseten all offer batch (~50%) and/or cached-input pricing. Consumer credit apps generally don't.