Holds 9 companies · First observed July 2024 · Updated June 2026 Explore in the graph

Prompt Caching as a Structural Pricing Feature

Quick answer

Nine AI API vendors now publish a discounted cached-input rate — 50-80% below the standard input price — as a structural pricing tier. Caching rewards workloads with stable, repeated system prompts and raises switching costs for RAG and agent-heavy applications.

9 vendors publish discounted cached-input pricing

What's happening — and why

What's happening: prompt and context caching has become a standard pricing feature across frontier LLM APIs. Nine corpus companies publish a distinct cached-input price that applies when input tokens match a previously stored prefix. Discounts range from 50% (OpenAI) to 75-80% (Anthropic, Google, DeepSeek).

Why: caching is an efficiency win for both sides. For vendors, a cached prompt avoids re-encoding the same context — lower compute cost. For buyers, applications that reuse a stable system prompt, RAG document set, or codebase context can cut input costs dramatically. The vendor passes the savings while keeping per-token rates for fresh, diverse inputs.

The strategic implication is stickiness. A workload heavily invested in a vendor's caching structure — with optimal cache-key design and stored contexts — is expensive to migrate. The cached-input discount is both a cost reduction and a switching-cost mechanism.

How it works

Nine vendors offer 50-80% off for cached input tokens — standard at frontier labs, spreading to inference platforms.

Evidence over time

9 supporting · 1 counter — hover or tap a point for detail, click to jump to the row.

supporting evidence counterexample

Evidence

Company	Date	What happened
anthropic	Aug 2024	Prompt Caching launched August 2024: $3.75/1M cached input (vs $15/1M full) for Claude 3.5 Sonnet — 75% discount on repeated context
openai	Oct 2024	Context caching launched October 2024: 50% discount on cached input tokens across GPT-4o and GPT-4o-mini
google-gemini	Jul 2024	Context caching in Gemini API: $0.01875/1M cached input (vs $0.075) for Gemini 1.5 Flash — 75% discount; minimum 32k token cache
deepseek	Jan 2025	DeepSeek V3 cache hits priced at $0.07/1M (vs $0.27 input) — 74% off for cache hits
together-ai	Jun 2025	Prompt caching available across Llama and Qwen models; discount rates on repeat context
groq	Sep 2025	On-demand context caching for qualifying models; cache storage free for limited windows
fireworks-ai	Sep 2025	Context caching supported on Llama 3.x family; cache hits discounted vs fresh input
baseten	Feb 2026	Cached-input pricing added to Model APIs — discounted rate for repeated prefill context in multi-tenant inference
replicate	Jun 2025	Prediction warmup and model caching features reduce cold-start costs; effectively a caching discount for warm models

Counterexamples

mistral-ai · — — La Plateforme offers no published cached-input pricing; pure per-token with no caching discount tier
cohere · — — No cached-input pricing — charges full input-token rate regardless of prompt reuse
cerebras · May 2026 — Cerebras inference uses wafer-scale hardware where caching economics differ; no cache discount tier published
groq · — — Cache offering is limited; not all models qualify and storage windows are constrained

For buyers

If your application has a large, stable system prompt or RAG document context, cached-input pricing can cut input costs by 50-80%. Design your prompt architecture with caching in mind — keep the stable prefix at the front, variable parts after. But remember: cache investments are provider-specific; they raise switching costs.

For vendors

Cached-input pricing rewards your most loyal, highest-usage customers — those with established production pipelines with stable system prompts. The discount is a retention mechanism: once a customer has optimised their prompt architecture for your caching system, migration is expensive.

Outlook — what to watch

Expect caching to spread from frontier labs to more inference platforms. Baseten's February 2026 addition extended it to multi-tenant serving. Vendors without caching compete on price alone for the stable-prompt segment — that will push adoption. Watch for caching SLA tiers (guaranteed cache hit rates) as a premium feature.

Bottom line

Nine corpus vendors now publish cached-input pricing at 50-80% off standard input rates. Caching is a structural pricing tier that rewards stable-prompt workloads and raises switching costs.

FAQ

What is cached-input pricing in AI APIs?

A discounted per-token rate that applies when your input tokens match a previously stored prefix (cached context). Anthropic charges 75% off, OpenAI 50% off, Google 75% off, DeepSeek 74% off.

How much can I save with prompt caching?

If your application has a stable system prompt or document context, you can cut input costs by 50-80%. A 10k-token system prompt on Anthropic Claude, called 1,000 times, saves about $112 vs uncached.

Does caching work with RAG?

Yes — if your RAG pipeline prepends a fixed set of documents to every prompt, that prefix can be cached. Variable query context after the fixed prefix is still charged at the full input rate.

All trends