How much does Replicate cost per month?

Replicate has no monthly subscription fee — you pay only for the per-second execution time of public and dedicated models, plus any per-output or per-token pricing for premium models. A typical FLUX Dev workload generating 1,000 images/month would cost $25 (at $0.025/image) or roughly $5–$8 per-second time-billed depending on inference latency.

What is the difference between Replicate public models and dedicated deployments?

Public models are multi-tenant — they share GPU capacity with other customers and billing is per-second of execution time on the underlying hardware. Dedicated deployments are single-tenant: your own GPU instance with guaranteed throughput, billed per-second only when active (no idle charges). Public models are cheaper for sporadic workloads; dedicated wins at sustained high QPS.

What are Replicate's per-second GPU rates?

Dedicated deployment per-second rates: CPU Small $0.000025, CPU $0.0001, T4 $0.000225, L40S $0.000975, A100 80GB $0.0014, H100 $0.001525. Translated to per-hour: A100 ~$5.04/hr, H100 ~$5.49/hr. Multi-GPU deployments available with committed spend contracts.

Does Replicate have per-output or per-token pricing?

Yes for select premium models. Per-image: FLUX 1.1 Pro $0.04, FLUX Dev $0.025, Ideogram v3 $0.09. Per-token: Claude 3.7 Sonnet $3.00 input / $15 output per 1M, DeepSeek R1 $3.75 input / $10 output per 1M. Most public models default to per-second time billing.

What is Cog and how does it relate to pricing?

Cog is Replicate's open-source model-packaging framework (created at Replicate, MIT-licensed). It packages PyTorch / TensorFlow models with their runtime dependencies for deployment to Replicate. Cog-packaged models bill at the per-second rate of the underlying hardware — no model markup, the GPU rate is the price.

Does Replicate offer a free tier?

Yes — new accounts receive free monthly credits for evaluation. The free tier covers light evaluation of public models and lightweight fine-tuning experiments; production workloads require credit card on file. Free credits do not roll over month-to-month.

Replicate Pricing

AI Summary

Replicate runs a hybrid pricing model: public models billed per-second of execution by underlying hardware (most flexible, lowest overhead), select premium models billed per-output (FLUX 1.1 Pro at $0.04/image, FLUX Dev at $0.025/image, Ideogram v3 at $0.09/image) or per-token (Claude 3.7 Sonnet at $3.00/$15 per 1M, DeepSeek R1 at $3.75/$10 per 1M).
Dedicated deployment GPU rates per second: CPU Small $0.000025, CPU $0.0001, T4 $0.000225, L40S $0.000975, A100 80GB $0.0014, H100 $0.001525 — billed only for active inference time with no idle charge.
Public model catalog includes 50,000+ community models packaged with Cog (Replicate's open-source model-packaging framework); private deployments support custom Cog-packaged models with the same per-second billing as public models.
Fine-tuning available on select base models (Llama variants, FLUX) with 'fast booting' billing — customers pay only for active training time. Inference on fine-tuned models bills at the standard per-second rate for the chosen hardware.
Enterprise tier offers volume discounts, dedicated multi-GPU committed contracts, custom SLAs, and dedicated support. Free tier provides limited monthly credits for evaluation.
Founded 2019 by Ben Firshman (co-creator of Docker Compose) and Andreas Jansson. Raised Series A ($17.8M) in 2022 led by a16z, Series B ($40M) in 2024 led by a16z and Sequoia at ~$350M valuation.

Pricing summary

Replicate 2026 — Per-second public + dedicated GPU + per-output premium

Public models time-billed; dedicated H100 $0.001525/sec; per-image FLUX $0.025–$0.09; per-token Claude $3/$15

Free trial

$0 + monthly credits

Evaluating Replicate for proof-of-value

Pay-as-you-go

Per second (varies by HW)

Production AI applications

Annual commit

Enterprise

Custom

Sustained workloads, multi-GPU training

Dedicated GPU

From $0.000025 /sec (CPU Small)

Per-second single-tenant GPU rentals

Per-output / Per-token premium

From $0.003 /image (FLUX Schnell)

Premium models with simplified pricing

Per-second public-model billing pays only for active execution. Per-output image and per-token LLM pricing available for premium models as a forecasting-friendly alternative. Multi-GPU deployments via Enterprise commits.

About

Replicate is a San Francisco-based AI infrastructure company founded in September 2019 by Ben Firshman (co-creator of Docker Compose at Docker) and Andreas Jansson. The product is a cloud platform for running, fine-tuning, and deploying AI models via REST API — paired with Cog, the open-source model-packaging framework Replicate created and maintains. The platform hosts 50,000+ public community models (the largest public-model catalog in managed inference) and supports single-tenant dedicated deployments for production workloads. The canonical interface for running a model is replicate.run("owner/model") — a one-line API call that hides infrastructure entirely.

By 2026 Replicate serves Buzzfeed, Lex, Suno, Captions, Krea, and roughly 5,000 paying customers spanning AI-native startups (image generation, music synthesis, video tools), product engineering teams adding AI features, and enterprise customers running fine-tuned models in production. The company raised a $17.8M Series A in October 2022 led by Andreessen Horowitz, followed by a $40M Series B in March 2024 co-led by a16z and Sequoia Capital at ~$350M valuation. The Cog framework has become the de-facto standard for packaging ML models with PyTorch and TensorFlow runtimes — analogous to Baseten’s Truss.

Replicate competes with Modal, Baseten, RunPod, Fireworks AI, Together AI, and first-party model providers (OpenAI, Anthropic) for the managed-inference market. Its differentiation is the combination of the largest public-model catalog (a community discovery moat that competitors cannot easily replicate), Cog as the open-source packaging standard, per-second billing on both public and dedicated SKUs, and founder credibility (Ben Firshman’s Docker Compose authorship makes the developer-experience pitch credible).

Pricing summary : How Replicate’s per-second + per-output + per-token stack works

Replicate runs three parallel pricing surfaces. Public model time billing is the default: any of the 50,000+ public models in the catalog bills at the per-second rate of the underlying hardware while inference is active — no model markup, the GPU rate is the price. Per-output and per-token premium pricing applies to select hosted models where simplified forecasting matters more than raw cost: FLUX 1.1 Pro at $0.04/image, FLUX Dev at $0.025/image, Ideogram v3 at $0.09/image, Claude 3.7 Sonnet at $3.00 input / $15 output per 1M tokens, DeepSeek R1 at $3.75/$10 per 1M. Dedicated deployments are single-tenant per-second GPU rentals (CPU Small $0.000025, CPU $0.0001, T4 $0.000225, L40S $0.000975, A100 80GB $0.0014, H100 $0.001525) — billed only while active, with no idle charges.

The free tier provides limited monthly credits for evaluation; Enterprise tier offers volume discounts, multi-GPU committed contracts, custom SLAs, and dedicated support. This three-SKU pure-usage architecture — per-second + per-output + per-token — gives customers granular control over the latency-vs-forecasting trade-off. The pricing flexibility is unusual in AI infrastructure and reflects Replicate’s bet that different workloads want different billing dimensions.

What makes this different: Replicate’s 50,000+ public model catalog is the largest in the industry, and Cog is the open-source packaging standard that creates ecosystem lock-in similar to how Docker became infrastructure-standard. The combination — community catalog + open-source packaging + per-second billing — makes Replicate the canonical entry point for AI engineers asking “is there an open-source model for X.”

Pricing by product

Public models (per-second by hardware)

Hardware	Per-second rate	Use case
CPU Small	$0.000025	Lightweight inference
CPU	$0.0001	Standard CPU inference
Nvidia T4	$0.000225	Small model inference, embeddings
Nvidia L40S	$0.000975	Mid-range models, image generation
Nvidia A100 80GB	$0.0014	30B–70B inference, fine-tuning
Nvidia H100	$0.001525	Frontier model serving, low-latency

Per-output premium models

Model	Rate	Notes
FLUX Schnell	$0.003	Per output image ($3.00 / 1,000)
FLUX Dev	$0.025	Per output image
FLUX 1.1 Pro	$0.04	Per output image
Recraft V3	$0.04	Per output image
Ideogram v3	$0.09	Per output image

Per-token LLM models

Model	Input ($/1M)	Output ($/1M)
Claude 3.7 Sonnet	$3.00	$15.00
DeepSeek R1	$3.75	$10.00

Dedicated deployments

SKU	Notes
Single-GPU	Same per-second rates as public models, single-tenant
Multi-GPU	Enterprise / committed-spend only
Fast booting	Pay only for active time; no idle charges

Sales motions across products: PLG / self-serve for public-model API, per-output, per-token, and single-GPU dedicated; sales-led for multi-GPU dedicated and Enterprise commits. All prices accessed 2026-05-30 from replicate.com/pricing.

Hidden costs : What Replicate customers actually pay beyond the rate card

Archetype A: AI-native image-generation startup running FLUX Dev

A startup serving ~10,000 FLUX Dev image generations/day at 3-second average inference time on A100:

Line item	Monthly cost
Per-output billing (10K/day × 30 × $0.025)	$7,500
Alternative: time billing (3 sec × 10K/day × 30 × $0.0014)	$1,260
Estimated total (time billing)	~$1,260/month

For high-volume image generation, the choice between per-output ($0.025) and per-second time billing ($0.0014 × 3 sec = $0.0042) can drive 5× cost differences. Per-output is simpler to forecast; per-second is dramatically cheaper if customers can predict inference latency. Most production teams switch to per-second once they understand the dynamics.

Archetype B: Mid-market team running a fine-tuned Llama on dedicated H100

A team that fine-tuned Llama 3.3 70B and runs sustained inference on a dedicated H100 endpoint:

Line item	Monthly cost
H100 dedicated (8h/day × 30 × 3600 × $0.001525)	$1,318
Fine-tuning training (one-time, ~20 minutes on H100)	$18
Fast booting (idle time free; warm pool retained)	Included
Egress (large response payloads, not itemized)	Not on pricing page
Estimated total	~$1,336/month

The H100 dedicated rate at $0.001525/sec ($5.49/hour) is competitive with Fireworks ($7/hour) and Together ($6.49/hour on dedicated). Fast booting eliminates idle charges — a real cost advantage over per-minute platforms. The lack of itemized egress on the pricing page is the main forecasting gap.

Want to estimate your own Replicate bill? Use the Replicate pricing calculator to model per-second public-model time, per-output premium pricing, and dedicated GPU costs.

Pricing evolution : Replicate’s pricing history from Cog framework to multi-SKU platform

Cadence

Quarter	Product / SKU additions	Notes
2019 Q3	1	Replicate founded; Cog open-sourced
2021 Q4	1	Public model catalog launched; per-second time billing
2022 Q4	0	Series A ($17.8M) led by a16z
2023 Q3	1	Dedicated deployments + Cog production hardening
2024 Q1	0	Series B ($40M) at ~$350M valuation
2024 Q3	1	Per-output image pricing (FLUX, Ideogram, SD3)
2025 Q1	1	Per-token LLM pricing (Claude, DeepSeek)
2025 Q3	1	L40S + multi-GPU dedicated deployments
2026 Q1	1	Idle-time-free fine-tuning + fast booting

Tracked range: 2019 Q3–2026 Q1. Quarters not listed above were verified stable (0 price changes, 0 SKU additions).

Notable changes

2021-12-08 — Public model catalog launched with per-second time billing by hardware.
2023-07-28 — Dedicated deployments launched; A100 at $0.0014/sec, H100 at $0.001525/sec.
2024-08-12 — Per-output pricing introduced (FLUX 1.1 Pro $0.04, FLUX Dev $0.025, Ideogram v3 $0.09).
2025-02-26 — Per-token LLM pricing added for Claude 3.7 Sonnet and DeepSeek R1.
2025-09-25 — L40S and multi-GPU dedicated deployments launched.
2026-01-20 — Fast-booting fine-tuning launched with idle-time-free billing.

What’s unique : Replicate’s distinctive pricing mechanics

1. 50,000+ public model catalog is the largest in managed inference. No competitor approaches Replicate’s community catalog size. For AI engineers searching “is there an open-source model for X,” Replicate is the canonical entry point — and the catalog creates community-driven discovery moat that competitors cannot replicate without years of community accumulation. This is structural, not pricing — but it shapes what customers can buy.

2. Three pricing dimensions (time + output + token) on the same platform. Most platforms commit to one billing dimension; Replicate offers per-second time billing on most models, per-output image pricing on premium models, and per-token LLM pricing on select chat models. Customers can pick the billing dimension that matches their forecasting preference — a granularity unusual in the inference category.

3. Cog as open-source packaging standard. Cog (created at Replicate, MIT-licensed) is the de-facto framework for packaging PyTorch and TensorFlow models with their runtime dependencies. Cog-packaged models bill at the GPU rate — no model markup, the per-second rate is the price. This open-source-as-developer-moat strategy parallels Baseten’s Truss approach but at much larger ecosystem scale.

4. Idle-time-free fine-tuning with fast booting. Customers pay only for active training and inference time, not for cold starts or warm-pool retention. For sporadic fine-tuning workloads, this materially reduces total cost versus per-minute or per-hour competitors who bill idle time.

5. Founder credibility from Docker Compose authorship. Ben Firshman co-created Docker Compose — one of the most widely-deployed developer tools in containers. The implicit pitch “we know developer experience because we built the developer-experience standard” carries weight that competitors cannot replicate without similar founder pedigree.

Strengths & weaknesses

Strengths	Weaknesses
50,000+ public model catalog (largest in managed inference)	Per-output image pricing can be 5× more expensive than per-second time billing
Three pricing dimensions (time + output + token) on same platform	Network egress not itemized on pricing page
Cog open-source packaging is the ecosystem standard	Free tier credits do not roll over month-to-month
Docker Compose founder credibility for developer experience	Multi-GPU dedicated deployments require Enterprise contract
Fast booting + idle-time-free fine-tuning	Per-second public-model rates equivalent to per-hour ($5.49 H100) — not cheapest
Cog-packaged models bill at GPU rate, no model markup	Lacks published serverless cached input or batch API discounts

Billing UX : Replicate’s account controls and payment experience

Self-serve signup — Sign up at replicate.com with GitHub or email; free trial credits applied automatically. Credit card required for production usage.
Per-prediction usage metadata — API responses include execution time, hardware used, and cost per prediction — letting developers compute and surface real-time cost.
Workspace and project organization — Workspace-level usage aggregation; per-environment separation supported via API tokens.
Spend alerts — Configurable email alerts at $X spend per period; no hard spend caps documented.
Payment methods — Credit card and ACH on self-serve; wire transfer, invoice billing, and AWS/GCP Marketplace on Enterprise.
Annual commit pricing — Enterprise customers receive volume discounts in exchange for annual usage commitments and dedicated capacity.
Public model directory — Browse 50,000+ community models with cost-per-prediction estimates per hardware tier.
Cog CLI — Local Cog development with cog predict for testing; push to Replicate with cog push.
Multi-region availability — US standard; EU and APAC regions on Enterprise via dedicated deployment.

Strategic wins : Why Replicate’s pricing decisions worked

1. Cog as the open-source packaging standard built an ecosystem moat

By open-sourcing Cog in 2019 alongside the commercial platform, Replicate seeded a packaging framework that became the de-facto standard for ML model deployment. The open-source-as-developer-moat strategy parallels Docker’s own playbook — Replicate co-founder Ben Firshman literally co-created Docker Compose, which makes the analogy especially apt. Customers who package models with Cog for local development migrate naturally to Replicate for production.

2. 50,000+ public model catalog is a community discovery moat

The largest public-model catalog in managed inference makes Replicate the canonical entry point for AI engineers asking “is there an open-source model for X.” Community-driven discovery is hard to replicate — competitors with smaller catalogs face a search-and-discovery gap that compounds over time as community contributors keep adding to Replicate.

3. Three pricing dimensions (time + output + token) captured all workload preferences

By offering per-second time billing, per-output image pricing, and per-token LLM pricing on the same platform, Replicate captures customers regardless of their forecasting preference. Forecasting-sensitive teams use per-output / per-token; cost-sensitive teams use per-second time billing. The billing-dimension flexibility maximizes wallet share across workload archetypes.

4. Idle-time-free fine-tuning competed directly with serverless competitors

Fast-booting fine-tuning with no idle charges put Replicate on parity with Modal, Baseten, and other serverless-first platforms — closing a meaningful competitive gap for sporadic training workloads. The launch positioned Replicate as both “the model catalog” and “the production training and inference platform” simultaneously.

Areas to improve : Gaps in Replicate’s pricing approach

1. Per-output image pricing can be punitively expensive at scale

FLUX Dev per-output at $0.025/image equates to roughly $0.025 / 4 seconds = $0.00625/second — far more expensive than the $0.0014/sec A100 time-billed alternative. Customers who don’t realize this can pay 5× more than necessary. Surfacing the per-second equivalent for each per-output model on the pricing page would prevent the bill-shock pattern.

2. Network egress not itemized on pricing page

For high-volume customers serving large image, audio, or video payloads, egress can become a meaningful cost line. Replicate’s pricing page does not break out bandwidth pricing. Making egress pricing explicit (and ideally bundling a generous free egress allowance) would reduce a recurring source of surprise bills.

3. No published cached input or Batch API discounts

Fireworks, OpenAI, Anthropic, and Baseten all ship 50% cached input and Batch API discounts. Replicate does not — meaning RAG and agent-loop workloads with high prefix re-use cost more on Replicate than on competitors. Adding cached input discounts on per-token LLMs would close a meaningful competitive gap.

4. Multi-GPU dedicated requires Enterprise contract

Customers wanting multi-GPU deployments (training, large-context inference) must sign Enterprise contracts. Publishing a self-serve multi-GPU rate (even at limited availability) would let mid-market teams self-qualify and reduce sales-led friction for non-frontier multi-GPU workloads.

Monetization stack & signals : how Replicate builds & buys its revenue engine

Buys 2 Builds 1

The read — where the monetization investment is going

Replicate buys the spine behind its per-second pricing: Metronome is the meter + credit ledger, Stripe the processor. The tell is build→buy — it tore out a homegrown metering/fraud stack for Metronome and shipped prepaid credits in 14 days, choosing fraud durability over owning its own meter.

Stack — build vs buy

Builds in-house · 1

In-house metering & fraud checks (retired) In-house build Press Sep 2025

“A homegrown system of fraud checks had become complicated and burdensome to maintain. The team needed a simpler, more durable approach.”

Buys (vendor) · 2

Metronome Billing Press Sep 2025

“Metronome handles event metering and credit balances in real-time, so Replicate ensures that every customer is charged accurately the moment usage occurs.”
Stripe Payments Docs Apr 2026

“Replicate may use Stripe, Inc. ("Stripe") as our Payment Processor.”

Signals reviewed Jun 2026 · derived from press & filings, product docs

Key takeaways

Open-source packaging as the structural ecosystem moat. Cog’s role as the de-facto packaging framework gives Replicate compounding adoption advantages. Infrastructure commercializations that lack an open-source packaging standard face a discoverability gap that pure-marketing positioning cannot close.
Multi-dimension pricing (time + output + token) captures forecasting-sensitive AND cost-sensitive buyers. Most platforms commit to one billing dimension; Replicate offers all three on the same platform. This value-metric flexibility is the canonical solution for inference platforms targeting diverse workload archetypes.
Community catalog size is a discovery moat that compounds over time. With 50,000+ public models, Replicate is the canonical “is there an open-source model for X” entry point. Newer competitors face a community-content gap that widens with each new model contribution.
Founder pedigree shapes which pitches land. Ben Firshman’s Docker Compose authorship makes Replicate’s developer-experience pitch credible in a way that pure-engineering teams cannot replicate. Infrastructure commercializations should treat founder open-source visibility as a structural advantage.
Idle-time-free billing is becoming table stakes for serverless inference. Modal, Baseten, and Replicate all ship some variant; competitors charging for idle time face an increasingly difficult positioning challenge. The pure-usage billing expectation has shifted decisively toward “pay only for active time.”

UBP implications

Open-source packaging as the next infrastructure moat. Truss (Baseten) and Cog (Replicate) demonstrate that open-source packaging frameworks can be more durable competitive advantages than runtime optimization. Future infrastructure platforms should seriously evaluate open-source packaging as a GTM strategy.
Multi-dimension pricing accommodates customer forecasting preferences. Forecasting-sensitive teams pick per-output / per-token; cost-sensitive teams pick per-second. Offering both on the same platform captures both segments without forcing a workload choice.
Community catalogs scale faster than per-vendor model curation. Replicate’s 50,000+ public models compound community contribution into a discovery moat that competitors cannot replicate quickly. Inference platforms should consider opening public catalogs as a long-term ecosystem strategy, not a short-term marketing tactic.

Sources

Replicate pricing page (accessed 2026-05-30)
Replicate docs — billing (accessed 2026-05-30)
Cog framework GitHub (accessed 2026-05-29)
Replicate blog — Series B announcement (accessed 2026-05-29)
Replicate public model catalog (accessed 2026-05-29)
Related infra blueprint — Modal
Related infra blueprint — Baseten
Blueprint corpus index

Bottom line

Replicate priced its inference platform around three structural ideas: a 50,000+ public model catalog that makes Replicate the canonical entry point for AI engineers asking “is there an open-source model for X,” Cog as the open-source packaging framework that creates ecosystem lock-in similar to how Docker became infrastructure-standard, and three parallel pricing dimensions (per-second time billing for cost-sensitive teams, per-output image pricing for forecasting-sensitive teams, per-token LLM pricing for chat-style workloads) that capture customers regardless of billing preference. Ben Firshman’s Docker Compose co-authorship lends unusual founder credibility to the developer-experience pitch.

For AI engineering teams running diverse inference workloads — image generation, fine-tuned chat, multi-modal pipelines — Replicate’s combination of catalog breadth, packaging standard, and flexible billing dimensions makes it one of the most pragmatic commercial inference platforms. The remaining gaps (per-output pricing punitive at scale, egress not itemized, no cached input or Batch discounts, multi-GPU gated behind Enterprise) are competitive parity issues rather than structural pricing flaws.

Compare with peers via the blueprint corpus, or model your own spend with the Replicate pricing calculator.

Pricing timeline : Major events on a vertical axis

Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.

Current Live Pricing Snapshot

May 2026

Live capture of the pricing and billing-docs pages: per-second public-model and dedicated GPU rates (A100 80GB $0.0014/sec, H100 $0.001525/sec), per-output image pricing (FLUX 1.1 Pro $0.04/image, FLUX Dev $0.025/image), and per-token LLM pricing (DeepSeek R1 $3.75/$10 per 1M). Idle-time-free billing with no monthly subscription fee.

Current Live Pricing Snapshot screenshot 1

Current Live Pricing Snapshot screenshot 2

Current Gallery + Hardware Pricing Structure

Mar 2026

The latest archived snapshot retains the featured-models gallery (per-output image models such as FLUX shown prominently) above a detailed hardware- pricing table covering CPU through multi-GPU tiers for public and private deployments. Pricing remains pure-usage: per-second for hardware, per-image for select image models, and per-token for hosted LLMs, with no idle charge.

captured 2026-03-01

Featured-Models Gallery Layout

Sep 2025

The pricing page shifted to a featured-models gallery leading with per-output image models (FLUX 1.1, FLUX Schnell, Imagen and others shown with per-image prices), with the per-second hardware-pricing table moved below. The reorganization put per-output model pricing first while keeping the full per-second hardware lineup available for public and private/ dedicated deployments.

captured 2025-09-01

Per-Output Image Pricing + Expanded Language Model Table

Aug 2024

The pricing page added an 'Image models' section with per-output (per-image) pricing for premium image models, alongside a substantially expanded 'Language models' per-token table. Combined with the existing hardware table, Replicate now presented three parallel billing units: per-second (public/dedicated hardware), per-image (image models), and per-token (LLMs).

captured 2024-08-01

Per-Token Language Model Pricing Added: Llama 2, Mistral

Mar 2024

A dedicated 'Language models' section appeared alongside the hardware table, introducing per-token pricing for select hosted LLMs including Llama 2 and Mistral. This marked Replicate's first move beyond pure per-second time billing toward a per-token unit for chat-style workloads, positioned as a simpler alternative to time-based billing.

captured 2024-03-01

Table Redesign + Per-Second Repricing and New GPU Tiers

Nov 2023

Replicate moved pricing into a structured hardware table and substantially repriced and expanded the GPU lineup: CPU $0.000100/sec, T4 $0.000225/sec (down from $0.00055), A40 $0.000575/sec, A40 (Large) $0.000725/sec, A100 (40GB) $0.001150/sec, A100 (80GB) $0.001400/sec (down from $0.0032), and 8x Nvidia A40 (Large) at $0.005800/sec. Most per-second rates dropped while new mid-range A40 and multi-GPU tiers were introduced.

captured 2023-11-01

A100 Split Into 40GB and 80GB Tiers

Jul 2023

The hardware lineup expanded from three to four tiers as the single A100 SKU split into Nvidia A100 (40GB) at $0.0023/sec and Nvidia A100 (80GB) at $0.0032/sec (144GB system RAM, 10x CPU). CPU ($0.0002/sec) and T4 ($0.00055/sec) rates held. This added a higher-memory option for larger models while keeping the same per-second time-billing structure.

captured 2023-07-01

Three-Tier Per-Second Pricing: CPU, T4, A100

Jul 2022

The Wayback snapshot shows Replicate's pricing built on three hardware tiers billed per-second: CPU at $0.0002/sec (8GB RAM), Nvidia T4 GPU at $0.00055/sec (16GB GPU RAM), and Nvidia A100 GPU at $0.0023/sec (40GB GPU RAM). The page framed it as 'GPUs are expensive, so why leave them on? Pay by the second.' Minimum billable time was 1 second, with no charge to sign up and no charge for canceled-before-start predictions.

captured 2022-07-01

Trivia

· Replicate's per-second public-model billing means a 4-second FLUX Dev image generation on an A100 costs roughly $0.0056 — finer granularity than competitors' per-image flat rates, though the per-image SKU ($0.025 for FLUX Dev) is still published as a simpler alternative.
· Replicate was founded in 2019 by Ben Firshman (creator of Docker Compose at Docker) and Andreas Jansson — making it the rare AI infrastructure platform where the founder co-created the single most-used developer tool in containers.
· Cog, Replicate's open-source model-packaging framework, predates the company's commercial inference platform by two years — and remains the de-facto standard for packaging ML models with PyTorch and TensorFlow runtimes, similar to how Truss became Baseten's developer wedge.

Questions & answers

How much does Replicate cost per month?: Replicate has no monthly subscription fee — you pay only for the per-second execution time of public and dedicated models, plus any per-output or per-token pricing for premium models. A typical FLUX Dev workload generating 1,000 images/month would cost $25 (at $0.025/image) or roughly $5–$8 per-second time-billed depending on inference latency.
What is the difference between Replicate public models and dedicated deployments?: Public models are multi-tenant — they share GPU capacity with other customers and billing is per-second of execution time on the underlying hardware. Dedicated deployments are single-tenant: your own GPU instance with guaranteed throughput, billed per-second only when active (no idle charges). Public models are cheaper for sporadic workloads; dedicated wins at sustained high QPS.
What are Replicate's per-second GPU rates?: Dedicated deployment per-second rates: CPU Small $0.000025, CPU $0.0001, T4 $0.000225, L40S $0.000975, A100 80GB $0.0014, H100 $0.001525. Translated to per-hour: A100 ~$5.04/hr, H100 ~$5.49/hr. Multi-GPU deployments available with committed spend contracts.
Does Replicate have per-output or per-token pricing?: Yes for select premium models. Per-image: FLUX 1.1 Pro $0.04, FLUX Dev $0.025, Ideogram v3 $0.09. Per-token: Claude 3.7 Sonnet $3.00 input / $15 output per 1M, DeepSeek R1 $3.75 input / $10 output per 1M. Most public models default to per-second time billing.
What is Cog and how does it relate to pricing?: Cog is Replicate's open-source model-packaging framework (created at Replicate, MIT-licensed). It packages PyTorch / TensorFlow models with their runtime dependencies for deployment to Replicate. Cog-packaged models bill at the per-second rate of the underlying hardware — no model markup, the GPU rate is the price.
Does Replicate offer a free tier?: Yes — new accounts receive free monthly credits for evaluation. The free tier covers light evaluation of public models and lightweight fine-tuning experiments; production workloads require credit card on file. Free credits do not roll over month-to-month.