Prompt Caching as a Structural Pricing Feature
Nine AI API vendors now publish a discounted cached-input rate — 50-80% below the standard input price — as a structural pricing tier. Caching rewards workloads with stable, repeated system prompts and raises switching costs for RAG and agent-heavy applications.
What's happening — and why
What's happening: prompt and context caching has become a standard pricing feature across frontier LLM APIs. Nine corpus companies publish a distinct cached-input price that applies when input tokens match a previously stored prefix. Discounts range from 50% (OpenAI) to 75-80% (Anthropic, Google, DeepSeek).
Why: caching is an efficiency win for both sides. For vendors, a cached prompt avoids re-encoding the same context — lower compute cost. For buyers, applications that reuse a stable system prompt, RAG document set, or codebase context can cut input costs dramatically. The vendor passes the savings while keeping per-token rates for fresh, diverse inputs.
The strategic implication is stickiness. A workload heavily invested in a vendor's caching structure — with optimal cache-key design and stored contexts — is expensive to migrate. The cached-input discount is both a cost reduction and a switching-cost mechanism.
How it works
Evidence over time
9 supporting · 1 counter — hover or tap a point for detail, click to jump to the row.
Evidence
| Company | Date | What happened |
|---|---|---|
| anthropic | Aug 2024 | Prompt Caching launched August 2024: $3.75/1M cached input (vs $15/1M full) for Claude 3.5 Sonnet — 75% discount on repeated context |
| openai | Oct 2024 | Context caching launched October 2024: 50% discount on cached input tokens across GPT-4o and GPT-4o-mini |
| google-gemini | Jul 2024 | Context caching in Gemini API: $0.01875/1M cached input (vs $0.075) for Gemini 1.5 Flash — 75% discount; minimum 32k token cache |
| deepseek | Jan 2025 | DeepSeek V3 cache hits priced at $0.07/1M (vs $0.27 input) — 74% off for cache hits |
| together-ai | Jun 2025 | Prompt caching available across Llama and Qwen models; discount rates on repeat context |
| groq | Sep 2025 | On-demand context caching for qualifying models; cache storage free for limited windows |
| fireworks-ai | Sep 2025 | Context caching supported on Llama 3.x family; cache hits discounted vs fresh input |
| baseten | Feb 2026 | Cached-input pricing added to Model APIs — discounted rate for repeated prefill context in multi-tenant inference |
| replicate | Jun 2025 | Prediction warmup and model caching features reduce cold-start costs; effectively a caching discount for warm models |
Counterexamples
- mistral-ai · — — La Plateforme offers no published cached-input pricing; pure per-token with no caching discount tier
- cohere · — — No cached-input pricing — charges full input-token rate regardless of prompt reuse
- cerebras · May 2026 — Cerebras inference uses wafer-scale hardware where caching economics differ; no cache discount tier published
- groq · — — Cache offering is limited; not all models qualify and storage windows are constrained
For buyers
If your application has a large, stable system prompt or RAG document context, cached-input pricing can cut input costs by 50-80%. Design your prompt architecture with caching in mind — keep the stable prefix at the front, variable parts after. But remember: cache investments are provider-specific; they raise switching costs.
For vendors
Cached-input pricing rewards your most loyal, highest-usage customers — those with established production pipelines with stable system prompts. The discount is a retention mechanism: once a customer has optimised their prompt architecture for your caching system, migration is expensive.
Outlook — what to watch
Expect caching to spread from frontier labs to more inference platforms. Baseten's February 2026 addition extended it to multi-tenant serving. Vendors without caching compete on price alone for the stable-prompt segment — that will push adoption. Watch for caching SLA tiers (guaranteed cache hit rates) as a premium feature.
Bottom line
Nine corpus vendors now publish cached-input pricing at 50-80% off standard input rates. Caching is a structural pricing tier that rewards stable-prompt workloads and raises switching costs.
FAQ
What is cached-input pricing in AI APIs?
A discounted per-token rate that applies when your input tokens match a previously stored prefix (cached context). Anthropic charges 75% off, OpenAI 50% off, Google 75% off, DeepSeek 74% off.
How much can I save with prompt caching?
If your application has a stable system prompt or document context, you can cut input costs by 50-80%. A 10k-token system prompt on Anthropic Claude, called 1,000 times, saves about $112 vs uncached.
Does caching work with RAG?
Yes — if your RAG pipeline prepends a fixed set of documents to every prompt, that prefix can be cached. Variable query context after the fixed prefix is still charged at the full input rate.