New 5 companies · First observed August 2024 · Updated June 2026

Batch and cache discounts as a standard playbook

Quick answer

Inference vendors have converged on two standard discounts — roughly 50% off for latency-tolerant batch jobs and a reduced rate for cached input. If part of your workload is async or shares a stable prompt prefix, these are the highest-leverage cuts on a token bill.

~50% off for batch / async workloads

What's happening — and why

What's happening: token APIs now routinely offer two big discounts — roughly half price for 'batch' jobs you're willing to wait on, and a reduced rate for cached (repeated) input such as a fixed system prompt or RAG context.

Why: latency-tolerant and repetitive work is cheaper for the vendor to serve — it can be scheduled onto idle capacity or skip recomputation. Pricing it lower sorts that load onto cheaper infrastructure and rewards buyers for flexibility, all without touching the headline real-time rate.

How it works

real-time cached input batch (async) 100% −up to 80% −50% (batch)
Price falls as you trade latency: full rate → cached input → ~50%-off batch.

Evidence over time

6 supporting · 2 counter — hover or tap a point for detail, click to jump to the row.

supports ↑ challenges ↓ 2024 2025 2026
supporting evidence counterexample

Evidence

Company Date What happened
Fireworks AI Mar 2025 Batch API at a flat 50% discount across all models.
Groq May 2025 Batch API plus cached-input discounts launched together.
Anthropic Aug 2024 Prompt caching cut input cost by up to 80%; batch API also offered at 50%.
Baseten Feb 2026 Cached-input pricing added to multi-tenant Model APIs.
Fireworks AI Nov 2024 Cached-input discount shipped alongside Turbo / Priority latency tiers.
Mistral AI May 2026 Batch processing earns a 50% discount on per-token rates.

Counterexamples

  • Suno · May 2026 — Consumer credit tiers — no batch or cache discount.
  • Midjourney · Feb 2025 — Uses fast vs relax compute modes instead of cache/batch — latency tiering by queue, not caching.

For buyers

If a meaningful share of your workload is asynchronous, batch roughly halves spend with no model change; if your prompts share a long stable prefix (system prompts, RAG context), caching compounds the saving. These are the highest-leverage, lowest-effort moves on a token bill.

For vendors

The playbook needs a batch queue with a relaxed SLA and a prompt-cache keyed on prefix hashes, each priced as its own line. The discounts are a segmentation tool — they sort latency-tolerant load onto cheaper infra without dropping your headline rate.

Outlook — what to watch

Expect these to become table stakes and to deepen: longer cache TTLs, automatic prompt-prefix caching, and tiered batch SLAs (1-hour vs 24-hour). The next frontier is priority/express pricing in the other direction — paying a premium for guaranteed low latency — turning latency into a full price axis.

Bottom line

Inference vendors have converged on ~50%-off batch and cached-input discounts as a de-facto standard. Anthropic, Mistral, Fireworks and Groq all land near the same numbers.

FAQ

How can I cut my LLM API bill without changing models?

Use batch processing for anything asynchronous (≈50% off) and prompt caching for repeated context like system prompts or RAG (a further large cut on input). Both are vendor-native and need no model change.

What is prompt caching?

A discount on input tokens that repeat across requests — the vendor caches a stable prefix (e.g. your system prompt) and charges a fraction of the normal rate to reuse it. Anthropic's cut input cost by up to 80%.

Which vendors offer batch and cache discounts?

It's now near-standard for token APIs — Anthropic, Mistral, Fireworks, Groq and Baseten all offer batch (~50%) and/or cached-input pricing. Consumer credit apps generally don't.

All trends