AI Summary
About
Replicate is a San Francisco-based AI infrastructure company founded in September 2019 by Ben Firshman (co-creator of Docker Compose at Docker) and Andreas Jansson. The product is a cloud platform for running, fine-tuning, and deploying AI models via REST API — paired with Cog, the open-source model-packaging framework Replicate created and maintains. The platform hosts 50,000+ public community models (the largest public-model catalog in managed inference) and supports single-tenant dedicated deployments for production workloads. The canonical interface for running a model is replicate.run("owner/model") — a one-line API call that hides infrastructure entirely.
By 2026 Replicate serves Buzzfeed, Lex, Suno, Captions, Krea, and roughly 5,000 paying customers spanning AI-native startups (image generation, music synthesis, video tools), product engineering teams adding AI features, and enterprise customers running fine-tuned models in production. The company raised a $17.8M Series A in October 2022 led by Andreessen Horowitz, followed by a $40M Series B in March 2024 co-led by a16z and Sequoia Capital at ~$350M valuation. The Cog framework has become the de-facto standard for packaging ML models with PyTorch and TensorFlow runtimes — analogous to Baseten’s Truss.
Replicate competes with Modal, Baseten, RunPod, Fireworks AI, Together AI, and first-party model providers (OpenAI, Anthropic) for the managed-inference market. Its differentiation is the combination of the largest public-model catalog (a community discovery moat that competitors cannot easily replicate), Cog as the open-source packaging standard, per-second billing on both public and dedicated SKUs, and founder credibility (Ben Firshman’s Docker Compose authorship makes the developer-experience pitch credible).
Pricing summary : How Replicate’s per-second + per-output + per-token stack works
Replicate runs three parallel pricing surfaces. Public model time billing is the default: any of the 50,000+ public models in the catalog bills at the per-second rate of the underlying hardware while inference is active — no model markup, the GPU rate is the price. Per-output and per-token premium pricing applies to select hosted models where simplified forecasting matters more than raw cost: FLUX 1.1 Pro at $0.04/image, FLUX Dev at $0.025/image, Ideogram v3 at $0.09/image, Claude 3.7 Sonnet at $3.00 input / $15 output per 1M tokens, DeepSeek R1 at $3.75/$10 per 1M. Dedicated deployments are single-tenant per-second GPU rentals (CPU Small $0.000025, CPU $0.0001, T4 $0.000225, L40S $0.000975, A100 80GB $0.0014, H100 $0.001525) — billed only while active, with no idle charges.
The free tier provides limited monthly credits for evaluation; Enterprise tier offers volume discounts, multi-GPU committed contracts, custom SLAs, and dedicated support. This three-SKU pure-usage architecture — per-second + per-output + per-token — gives customers granular control over the latency-vs-forecasting trade-off. The pricing flexibility is unusual in AI infrastructure and reflects Replicate’s bet that different workloads want different billing dimensions.
What makes this different: Replicate’s 50,000+ public model catalog is the largest in the industry, and Cog is the open-source packaging standard that creates ecosystem lock-in similar to how Docker became infrastructure-standard. The combination — community catalog + open-source packaging + per-second billing — makes Replicate the canonical entry point for AI engineers asking “is there an open-source model for X.”
Pricing by product
Public models (per-second by hardware)
| Hardware | Per-second rate | Use case |
|---|---|---|
| CPU Small | $0.000025 | Lightweight inference |
| CPU | $0.0001 | Standard CPU inference |
| Nvidia T4 | $0.000225 | Small model inference, embeddings |
| Nvidia L40S | $0.000975 | Mid-range models, image generation |
| Nvidia A100 80GB | $0.0014 | 30B–70B inference, fine-tuning |
| Nvidia H100 | $0.001525 | Frontier model serving, low-latency |
Per-output premium models
| Model | Rate | Notes |
|---|---|---|
| FLUX Schnell | $0.003 | Per output image ($3.00 / 1,000) |
| FLUX Dev | $0.025 | Per output image |
| FLUX 1.1 Pro | $0.04 | Per output image |
| Recraft V3 | $0.04 | Per output image |
| Ideogram v3 | $0.09 | Per output image |
Per-token LLM models
| Model | Input ($/1M) | Output ($/1M) |
|---|---|---|
| Claude 3.7 Sonnet | $3.00 | $15.00 |
| DeepSeek R1 | $3.75 | $10.00 |
Dedicated deployments
| SKU | Notes |
|---|---|
| Single-GPU | Same per-second rates as public models, single-tenant |
| Multi-GPU | Enterprise / committed-spend only |
| Fast booting | Pay only for active time; no idle charges |
Sales motions across products: PLG / self-serve for public-model API, per-output, per-token, and single-GPU dedicated; sales-led for multi-GPU dedicated and Enterprise commits. All prices accessed 2026-05-30 from replicate.com/pricing.
Hidden costs : What Replicate customers actually pay beyond the rate card
Archetype A: AI-native image-generation startup running FLUX Dev
A startup serving ~10,000 FLUX Dev image generations/day at 3-second average inference time on A100:
| Line item | Monthly cost |
|---|---|
| Per-output billing (10K/day × 30 × $0.025) | $7,500 |
| Alternative: time billing (3 sec × 10K/day × 30 × $0.0014) | $1,260 |
| Estimated total (time billing) | ~$1,260/month |
For high-volume image generation, the choice between per-output ($0.025) and per-second time billing ($0.0014 × 3 sec = $0.0042) can drive 5× cost differences. Per-output is simpler to forecast; per-second is dramatically cheaper if customers can predict inference latency. Most production teams switch to per-second once they understand the dynamics.
Archetype B: Mid-market team running a fine-tuned Llama on dedicated H100
A team that fine-tuned Llama 3.3 70B and runs sustained inference on a dedicated H100 endpoint:
| Line item | Monthly cost |
|---|---|
| H100 dedicated (8h/day × 30 × 3600 × $0.001525) | $1,318 |
| Fine-tuning training (one-time, ~20 minutes on H100) | $18 |
| Fast booting (idle time free; warm pool retained) | Included |
| Egress (large response payloads, not itemized) | Not on pricing page |
| Estimated total | ~$1,336/month |
The H100 dedicated rate at $0.001525/sec ($5.49/hour) is competitive with Fireworks ($7/hour) and Together ($6.49/hour on dedicated). Fast booting eliminates idle charges — a real cost advantage over per-minute platforms. The lack of itemized egress on the pricing page is the main forecasting gap.
Want to estimate your own Replicate bill? Use the Replicate pricing calculator to model per-second public-model time, per-output premium pricing, and dedicated GPU costs.
Pricing evolution : Replicate’s pricing history from Cog framework to multi-SKU platform
Cadence
| Quarter | Price changes | Product / SKU additions | Notes |
|---|---|---|---|
| 2019 Q3 | 0 | 1 | Replicate founded; Cog open-sourced |
| 2021 Q4 | 0 | 1 | Public model catalog launched; per-second time billing |
| 2022 Q4 | 0 | 0 | Series A ($17.8M) led by a16z |
| 2023 Q3 | 0 | 1 | Dedicated deployments + Cog production hardening |
| 2024 Q1 | 0 | 0 | Series B ($40M) at ~$350M valuation |
| 2024 Q3 | 0 | 1 | Per-output image pricing (FLUX, Ideogram, SD3) |
| 2025 Q1 | 0 | 1 | Per-token LLM pricing (Claude, DeepSeek) |
| 2025 Q3 | 0 | 1 | L40S + multi-GPU dedicated deployments |
| 2026 Q1 | 0 | 1 | Idle-time-free fine-tuning + fast booting |
Tracked range: 2019 Q3–2026 Q1. Quarters not listed above were verified stable (0 price changes, 0 SKU additions).
Notable changes
- 2021-12-08 — Public model catalog launched with per-second time billing by hardware.
- 2023-07-28 — Dedicated deployments launched; A100 at $0.0014/sec, H100 at $0.001525/sec.
- 2024-08-12 — Per-output pricing introduced (FLUX 1.1 Pro $0.04, FLUX Dev $0.025, Ideogram v3 $0.09).
- 2025-02-26 — Per-token LLM pricing added for Claude 3.7 Sonnet and DeepSeek R1.
- 2025-09-25 — L40S and multi-GPU dedicated deployments launched.
- 2026-01-20 — Fast-booting fine-tuning launched with idle-time-free billing.
What’s unique : Replicate’s distinctive pricing mechanics
1. 50,000+ public model catalog is the largest in managed inference. No competitor approaches Replicate’s community catalog size. For AI engineers searching “is there an open-source model for X,” Replicate is the canonical entry point — and the catalog creates community-driven discovery moat that competitors cannot replicate without years of community accumulation. This is structural, not pricing — but it shapes what customers can buy.
2. Three pricing dimensions (time + output + token) on the same platform. Most platforms commit to one billing dimension; Replicate offers per-second time billing on most models, per-output image pricing on premium models, and per-token LLM pricing on select chat models. Customers can pick the billing dimension that matches their forecasting preference — a granularity unusual in the inference category.
3. Cog as open-source packaging standard. Cog (created at Replicate, MIT-licensed) is the de-facto framework for packaging PyTorch and TensorFlow models with their runtime dependencies. Cog-packaged models bill at the GPU rate — no model markup, the per-second rate is the price. This open-source-as-developer-moat strategy parallels Baseten’s Truss approach but at much larger ecosystem scale.
4. Idle-time-free fine-tuning with fast booting. Customers pay only for active training and inference time, not for cold starts or warm-pool retention. For sporadic fine-tuning workloads, this materially reduces total cost versus per-minute or per-hour competitors who bill idle time.
5. Founder credibility from Docker Compose authorship. Ben Firshman co-created Docker Compose — one of the most widely-deployed developer tools in containers. The implicit pitch “we know developer experience because we built the developer-experience standard” carries weight that competitors cannot replicate without similar founder pedigree.
Strengths & weaknesses
| Strengths | Weaknesses |
|---|---|
| 50,000+ public model catalog (largest in managed inference) | Per-output image pricing can be 5× more expensive than per-second time billing |
| Three pricing dimensions (time + output + token) on same platform | Network egress not itemized on pricing page |
| Cog open-source packaging is the ecosystem standard | Free tier credits do not roll over month-to-month |
| Docker Compose founder credibility for developer experience | Multi-GPU dedicated deployments require Enterprise contract |
| Fast booting + idle-time-free fine-tuning | Per-second public-model rates equivalent to per-hour ($5.49 H100) — not cheapest |
| Cog-packaged models bill at GPU rate, no model markup | Lacks published serverless cached input or batch API discounts |
Billing UX : Replicate’s account controls and payment experience
- Self-serve signup — Sign up at
replicate.comwith GitHub or email; free trial credits applied automatically. Credit card required for production usage. - Per-prediction usage metadata — API responses include execution time, hardware used, and cost per prediction — letting developers compute and surface real-time cost.
- Workspace and project organization — Workspace-level usage aggregation; per-environment separation supported via API tokens.
- Spend alerts — Configurable email alerts at $X spend per period; no hard spend caps documented.
- Payment methods — Credit card and ACH on self-serve; wire transfer, invoice billing, and AWS/GCP Marketplace on Enterprise.
- Annual commit pricing — Enterprise customers receive volume discounts in exchange for annual usage commitments and dedicated capacity.
- Public model directory — Browse 50,000+ community models with cost-per-prediction estimates per hardware tier.
- Cog CLI — Local Cog development with
cog predictfor testing; push to Replicate withcog push. - Multi-region availability — US standard; EU and APAC regions on Enterprise via dedicated deployment.
Strategic wins : Why Replicate’s pricing decisions worked
1. Cog as the open-source packaging standard built an ecosystem moat
By open-sourcing Cog in 2019 alongside the commercial platform, Replicate seeded a packaging framework that became the de-facto standard for ML model deployment. The open-source-as-developer-moat strategy parallels Docker’s own playbook — Replicate co-founder Ben Firshman literally co-created Docker Compose, which makes the analogy especially apt. Customers who package models with Cog for local development migrate naturally to Replicate for production.
2. 50,000+ public model catalog is a community discovery moat
The largest public-model catalog in managed inference makes Replicate the canonical entry point for AI engineers asking “is there an open-source model for X.” Community-driven discovery is hard to replicate — competitors with smaller catalogs face a search-and-discovery gap that compounds over time as community contributors keep adding to Replicate.
3. Three pricing dimensions (time + output + token) captured all workload preferences
By offering per-second time billing, per-output image pricing, and per-token LLM pricing on the same platform, Replicate captures customers regardless of their forecasting preference. Forecasting-sensitive teams use per-output / per-token; cost-sensitive teams use per-second time billing. The billing-dimension flexibility maximizes wallet share across workload archetypes.
4. Idle-time-free fine-tuning competed directly with serverless competitors
Fast-booting fine-tuning with no idle charges put Replicate on parity with Modal, Baseten, and other serverless-first platforms — closing a meaningful competitive gap for sporadic training workloads. The launch positioned Replicate as both “the model catalog” and “the production training and inference platform” simultaneously.
Areas to improve : Gaps in Replicate’s pricing approach
1. Per-output image pricing can be punitively expensive at scale
FLUX Dev per-output at $0.025/image equates to roughly $0.025 / 4 seconds = $0.00625/second — far more expensive than the $0.0014/sec A100 time-billed alternative. Customers who don’t realize this can pay 5× more than necessary. Surfacing the per-second equivalent for each per-output model on the pricing page would prevent the bill-shock pattern.
2. Network egress not itemized on pricing page
For high-volume customers serving large image, audio, or video payloads, egress can become a meaningful cost line. Replicate’s pricing page does not break out bandwidth pricing. Making egress pricing explicit (and ideally bundling a generous free egress allowance) would reduce a recurring source of surprise bills.
3. No published cached input or Batch API discounts
Fireworks, OpenAI, Anthropic, and Baseten all ship 50% cached input and Batch API discounts. Replicate does not — meaning RAG and agent-loop workloads with high prefix re-use cost more on Replicate than on competitors. Adding cached input discounts on per-token LLMs would close a meaningful competitive gap.
4. Multi-GPU dedicated requires Enterprise contract
Customers wanting multi-GPU deployments (training, large-context inference) must sign Enterprise contracts. Publishing a self-serve multi-GPU rate (even at limited availability) would let mid-market teams self-qualify and reduce sales-led friction for non-frontier multi-GPU workloads.
Key takeaways
-
Open-source packaging as the structural ecosystem moat. Cog’s role as the de-facto packaging framework gives Replicate compounding adoption advantages. Infrastructure commercializations that lack an open-source packaging standard face a discoverability gap that pure-marketing positioning cannot close.
-
Multi-dimension pricing (time + output + token) captures forecasting-sensitive AND cost-sensitive buyers. Most platforms commit to one billing dimension; Replicate offers all three on the same platform. This value-metric flexibility is the canonical solution for inference platforms targeting diverse workload archetypes.
-
Community catalog size is a discovery moat that compounds over time. With 50,000+ public models, Replicate is the canonical “is there an open-source model for X” entry point. Newer competitors face a community-content gap that widens with each new model contribution.
-
Founder pedigree shapes which pitches land. Ben Firshman’s Docker Compose authorship makes Replicate’s developer-experience pitch credible in a way that pure-engineering teams cannot replicate. Infrastructure commercializations should treat founder open-source visibility as a structural advantage.
-
Idle-time-free billing is becoming table stakes for serverless inference. Modal, Baseten, and Replicate all ship some variant; competitors charging for idle time face an increasingly difficult positioning challenge. The pure-usage billing expectation has shifted decisively toward “pay only for active time.”
UBP implications
-
Open-source packaging as the next infrastructure moat. Truss (Baseten) and Cog (Replicate) demonstrate that open-source packaging frameworks can be more durable competitive advantages than runtime optimization. Future infrastructure platforms should seriously evaluate open-source packaging as a GTM strategy.
-
Multi-dimension pricing accommodates customer forecasting preferences. Forecasting-sensitive teams pick per-output / per-token; cost-sensitive teams pick per-second. Offering both on the same platform captures both segments without forcing a workload choice.
-
Community catalogs scale faster than per-vendor model curation. Replicate’s 50,000+ public models compound community contribution into a discovery moat that competitors cannot replicate quickly. Inference platforms should consider opening public catalogs as a long-term ecosystem strategy, not a short-term marketing tactic.
Sources
- Replicate pricing page (accessed 2026-05-30)
- Replicate docs — billing (accessed 2026-05-30)
- Cog framework GitHub (accessed 2026-05-29)
- Replicate blog — Series B announcement (accessed 2026-05-29)
- Replicate public model catalog (accessed 2026-05-29)
- Related infra blueprint — Modal
- Related infra blueprint — Baseten
- Blueprint corpus index
Bottom line
Replicate priced its inference platform around three structural ideas: a 50,000+ public model catalog that makes Replicate the canonical entry point for AI engineers asking “is there an open-source model for X,” Cog as the open-source packaging framework that creates ecosystem lock-in similar to how Docker became infrastructure-standard, and three parallel pricing dimensions (per-second time billing for cost-sensitive teams, per-output image pricing for forecasting-sensitive teams, per-token LLM pricing for chat-style workloads) that capture customers regardless of billing preference. Ben Firshman’s Docker Compose co-authorship lends unusual founder credibility to the developer-experience pitch.
For AI engineering teams running diverse inference workloads — image generation, fine-tuned chat, multi-modal pipelines — Replicate’s combination of catalog breadth, packaging standard, and flexible billing dimensions makes it one of the most pragmatic commercial inference platforms. The remaining gaps (per-output pricing punitive at scale, egress not itemized, no cached input or Batch discounts, multi-GPU gated behind Enterprise) are competitive parity issues rather than structural pricing flaws.
Compare with peers via the blueprint corpus, or model your own spend with the Replicate pricing calculator.
Pricing timeline : Major events on a vertical axis
Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.
Current Live Pricing Snapshot
Live capture of the pricing and billing-docs pages: per-second public-model and dedicated GPU rates (A100 80GB $0.0014/sec, H100 $0.001525/sec), per-output image pricing (FLUX 1.1 Pro $0.04/image, FLUX Dev $0.025/image), and per-token LLM pricing (DeepSeek R1 $3.75/$10 per 1M). Idle-time-free billing with no monthly subscription fee.
Current Gallery + Hardware Pricing Structure
The latest archived snapshot retains the featured-models gallery (per-output image models such as FLUX shown prominently) above a detailed hardware- pricing table covering CPU through multi-GPU tiers for public and private deployments. Pricing remains pure-usage: per-second for hardware, per-image for select image models, and per-token for hosted LLMs, with no idle charge.
Featured-Models Gallery Layout
The pricing page shifted to a featured-models gallery leading with per-output image models (FLUX 1.1, FLUX Schnell, Imagen and others shown with per-image prices), with the per-second hardware-pricing table moved below. The reorganization put per-output model pricing first while keeping the full per-second hardware lineup available for public and private/ dedicated deployments.
Per-Output Image Pricing + Expanded Language Model Table
The pricing page added an 'Image models' section with per-output (per-image) pricing for premium image models, alongside a substantially expanded 'Language models' per-token table. Combined with the existing hardware table, Replicate now presented three parallel billing units: per-second (public/dedicated hardware), per-image (image models), and per-token (LLMs).
Per-Token Language Model Pricing Added: Llama 2, Mistral
A dedicated 'Language models' section appeared alongside the hardware table, introducing per-token pricing for select hosted LLMs including Llama 2 and Mistral. This marked Replicate's first move beyond pure per-second time billing toward a per-token unit for chat-style workloads, positioned as a simpler alternative to time-based billing.
Table Redesign + Per-Second Repricing and New GPU Tiers
Replicate moved pricing into a structured hardware table and substantially repriced and expanded the GPU lineup: CPU $0.000100/sec, T4 $0.000225/sec (down from $0.00055), A40 $0.000575/sec, A40 (Large) $0.000725/sec, A100 (40GB) $0.001150/sec, A100 (80GB) $0.001400/sec (down from $0.0032), and 8x Nvidia A40 (Large) at $0.005800/sec. Most per-second rates dropped while new mid-range A40 and multi-GPU tiers were introduced.
A100 Split Into 40GB and 80GB Tiers
The hardware lineup expanded from three to four tiers as the single A100 SKU split into Nvidia A100 (40GB) at $0.0023/sec and Nvidia A100 (80GB) at $0.0032/sec (144GB system RAM, 10x CPU). CPU ($0.0002/sec) and T4 ($0.00055/sec) rates held. This added a higher-memory option for larger models while keeping the same per-second time-billing structure.
Three-Tier Per-Second Pricing: CPU, T4, A100
The Wayback snapshot shows Replicate's pricing built on three hardware tiers billed per-second: CPU at $0.0002/sec (8GB RAM), Nvidia T4 GPU at $0.00055/sec (16GB GPU RAM), and Nvidia A100 GPU at $0.0023/sec (40GB GPU RAM). The page framed it as 'GPUs are expensive, so why leave them on? Pay by the second.' Minimum billable time was 1 second, with no charge to sign up and no charge for canceled-before-start predictions.
- · Replicate's per-second public-model billing means a 4-second FLUX Dev image generation on an A100 costs roughly $0.0056 — finer granularity than competitors' per-image flat rates, though the per-image SKU ($0.025 for FLUX Dev) is still published as a simpler alternative.
- · Replicate was founded in 2019 by Ben Firshman (creator of Docker Compose at Docker) and Andreas Jansson — making it the rare AI infrastructure platform where the founder co-created the single most-used developer tool in containers.
- · Cog, Replicate's open-source model-packaging framework, predates the company's commercial inference platform by two years — and remains the de-facto standard for packaging ML models with PyTorch and TensorFlow runtimes, similar to how Truss became Baseten's developer wedge.
Questions & answers
- How much does Replicate cost per month?
- Replicate has no monthly subscription fee — you pay only for the per-second execution time of public and dedicated models, plus any per-output or per-token pricing for premium models. A typical FLUX Dev workload generating 1,000 images/month would cost $25 (at $0.025/image) or roughly $5–$8 per-second time-billed depending on inference latency.
- What is the difference between Replicate public models and dedicated deployments?
- Public models are multi-tenant — they share GPU capacity with other customers and billing is per-second of execution time on the underlying hardware. Dedicated deployments are single-tenant: your own GPU instance with guaranteed throughput, billed per-second only when active (no idle charges). Public models are cheaper for sporadic workloads; dedicated wins at sustained high QPS.
- What are Replicate's per-second GPU rates?
- Dedicated deployment per-second rates: CPU Small $0.000025, CPU $0.0001, T4 $0.000225, L40S $0.000975, A100 80GB $0.0014, H100 $0.001525. Translated to per-hour: A100 ~$5.04/hr, H100 ~$5.49/hr. Multi-GPU deployments available with committed spend contracts.
- Does Replicate have per-output or per-token pricing?
- Yes for select premium models. Per-image: FLUX 1.1 Pro $0.04, FLUX Dev $0.025, Ideogram v3 $0.09. Per-token: Claude 3.7 Sonnet $3.00 input / $15 output per 1M, DeepSeek R1 $3.75 input / $10 output per 1M. Most public models default to per-second time billing.
- What is Cog and how does it relate to pricing?
- Cog is Replicate's open-source model-packaging framework (created at Replicate, MIT-licensed). It packages PyTorch / TensorFlow models with their runtime dependencies for deployment to Replicate. Cog-packaged models bill at the per-second rate of the underlying hardware — no model markup, the GPU rate is the price.
- Does Replicate offer a free tier?
- Yes — new accounts receive free monthly credits for evaluation. The free tier covers light evaluation of public models and lightweight fine-tuning experiments; production workloads require credit card on file. Free credits do not roll over month-to-month.