All companies
technology

Replicate pricing

replicate.com facts checked analysis reviewed
Quick summary
Product
Cloud platform for running, fine-tuning, and deploying AI models via REST API
Industry
technology
Commits
Available (annual)
In this page
AI Summary
  • Replicate runs a hybrid pricing model: public models billed per-second of execution by underlying hardware (most flexible, lowest overhead), select premium models billed per-output (FLUX 1.1 Pro at $0.04/image, FLUX Dev at $0.025/image, Ideogram v3 at $0.09/image) or per-token (Claude 3.7 Sonnet at $3.00/$15 per 1M, DeepSeek R1 at $3.75/$10 per 1M).
  • Dedicated deployment GPU rates per second: CPU Small $0.000025, CPU $0.0001, T4 $0.000225, L40S $0.000975, A100 80GB $0.0014, H100 $0.001525 — billed only for active inference time with no idle charge.
  • Public model catalog includes 50,000+ community models packaged with Cog (Replicate's open-source model-packaging framework); private deployments support custom Cog-packaged models with the same per-second billing as public models.
  • Fine-tuning available on select base models (Llama variants, FLUX) with 'fast booting' billing — customers pay only for active training time. Inference on fine-tuned models bills at the standard per-second rate for the chosen hardware.
  • Enterprise tier offers volume discounts, dedicated multi-GPU committed contracts, custom SLAs, and dedicated support. Free tier provides limited monthly credits for evaluation.
  • Founded 2019 by Ben Firshman (co-creator of Docker Compose) and Andreas Jansson. Raised Series A ($17.8M) in 2022 led by a16z, Series B ($40M) in 2024 led by a16z and Sequoia at ~$350M valuation.
Pricing summary
Replicate 2026 — Per-second public + dedicated GPU + per-output premium
Public models time-billed; dedicated H100 $0.001525/sec; per-image FLUX $0.025–$0.09; per-token Claude $3/$15
Free trial
$0 + monthly credits
Evaluating Replicate for proof-of-value
Annual commit
Enterprise
Custom
Sustained workloads, multi-GPU training
Dedicated GPU
From $0.000025 /sec (CPU Small)
Per-second single-tenant GPU rentals
Per-output / Per-token premium
From $0.003 /image (FLUX Schnell)
Premium models with simplified pricing
Per-second public-model billing pays only for active execution. Per-output image and per-token LLM pricing available for premium models as a forecasting-friendly alternative. Multi-GPU deployments via Enterprise commits.

About

Replicate is a San Francisco-based AI infrastructure company founded in September 2019 by Ben Firshman (co-creator of Docker Compose at Docker) and Andreas Jansson. The product is a cloud platform for running, fine-tuning, and deploying AI models via REST API — paired with Cog, the open-source model-packaging framework Replicate created and maintains. The platform hosts 50,000+ public community models (the largest public-model catalog in managed inference) and supports single-tenant dedicated deployments for production workloads. The canonical interface for running a model is replicate.run("owner/model") — a one-line API call that hides infrastructure entirely.

By 2026 Replicate serves Buzzfeed, Lex, Suno, Captions, Krea, and roughly 5,000 paying customers spanning AI-native startups (image generation, music synthesis, video tools), product engineering teams adding AI features, and enterprise customers running fine-tuned models in production. The company raised a $17.8M Series A in October 2022 led by Andreessen Horowitz, followed by a $40M Series B in March 2024 co-led by a16z and Sequoia Capital at ~$350M valuation. The Cog framework has become the de-facto standard for packaging ML models with PyTorch and TensorFlow runtimes — analogous to Baseten’s Truss.

Replicate competes with Modal, Baseten, RunPod, Fireworks AI, Together AI, and first-party model providers (OpenAI, Anthropic) for the managed-inference market. Its differentiation is the combination of the largest public-model catalog (a community discovery moat that competitors cannot easily replicate), Cog as the open-source packaging standard, per-second billing on both public and dedicated SKUs, and founder credibility (Ben Firshman’s Docker Compose authorship makes the developer-experience pitch credible).


Pricing summary : How Replicate’s per-second + per-output + per-token stack works

Replicate runs three parallel pricing surfaces. Public model time billing is the default: any of the 50,000+ public models in the catalog bills at the per-second rate of the underlying hardware while inference is active — no model markup, the GPU rate is the price. Per-output and per-token premium pricing applies to select hosted models where simplified forecasting matters more than raw cost: FLUX 1.1 Pro at $0.04/image, FLUX Dev at $0.025/image, Ideogram v3 at $0.09/image, Claude 3.7 Sonnet at $3.00 input / $15 output per 1M tokens, DeepSeek R1 at $3.75/$10 per 1M. Dedicated deployments are single-tenant per-second GPU rentals (CPU Small $0.000025, CPU $0.0001, T4 $0.000225, L40S $0.000975, A100 80GB $0.0014, H100 $0.001525) — billed only while active, with no idle charges.

The free tier provides limited monthly credits for evaluation; Enterprise tier offers volume discounts, multi-GPU committed contracts, custom SLAs, and dedicated support. This three-SKU pure-usage architecture — per-second + per-output + per-token — gives customers granular control over the latency-vs-forecasting trade-off. The pricing flexibility is unusual in AI infrastructure and reflects Replicate’s bet that different workloads want different billing dimensions.

What makes this different: Replicate’s 50,000+ public model catalog is the largest in the industry, and Cog is the open-source packaging standard that creates ecosystem lock-in similar to how Docker became infrastructure-standard. The combination — community catalog + open-source packaging + per-second billing — makes Replicate the canonical entry point for AI engineers asking “is there an open-source model for X.”


Pricing by product

Public models (per-second by hardware)

HardwarePer-second rateUse case
CPU Small$0.000025Lightweight inference
CPU$0.0001Standard CPU inference
Nvidia T4$0.000225Small model inference, embeddings
Nvidia L40S$0.000975Mid-range models, image generation
Nvidia A100 80GB$0.001430B–70B inference, fine-tuning
Nvidia H100$0.001525Frontier model serving, low-latency

Per-output premium models

ModelRateNotes
FLUX Schnell$0.003Per output image ($3.00 / 1,000)
FLUX Dev$0.025Per output image
FLUX 1.1 Pro$0.04Per output image
Recraft V3$0.04Per output image
Ideogram v3$0.09Per output image

Per-token LLM models

ModelInput ($/1M)Output ($/1M)
Claude 3.7 Sonnet$3.00$15.00
DeepSeek R1$3.75$10.00

Dedicated deployments

SKUNotes
Single-GPUSame per-second rates as public models, single-tenant
Multi-GPUEnterprise / committed-spend only
Fast bootingPay only for active time; no idle charges

Sales motions across products: PLG / self-serve for public-model API, per-output, per-token, and single-GPU dedicated; sales-led for multi-GPU dedicated and Enterprise commits. All prices accessed 2026-05-30 from replicate.com/pricing.


Hidden costs : What Replicate customers actually pay beyond the rate card

Archetype A: AI-native image-generation startup running FLUX Dev

A startup serving ~10,000 FLUX Dev image generations/day at 3-second average inference time on A100:

Line itemMonthly cost
Per-output billing (10K/day × 30 × $0.025)$7,500
Alternative: time billing (3 sec × 10K/day × 30 × $0.0014)$1,260
Estimated total (time billing)~$1,260/month

For high-volume image generation, the choice between per-output ($0.025) and per-second time billing ($0.0014 × 3 sec = $0.0042) can drive 5× cost differences. Per-output is simpler to forecast; per-second is dramatically cheaper if customers can predict inference latency. Most production teams switch to per-second once they understand the dynamics.

Archetype B: Mid-market team running a fine-tuned Llama on dedicated H100

A team that fine-tuned Llama 3.3 70B and runs sustained inference on a dedicated H100 endpoint:

Line itemMonthly cost
H100 dedicated (8h/day × 30 × 3600 × $0.001525)$1,318
Fine-tuning training (one-time, ~20 minutes on H100)$18
Fast booting (idle time free; warm pool retained)Included
Egress (large response payloads, not itemized)Not on pricing page
Estimated total~$1,336/month

The H100 dedicated rate at $0.001525/sec ($5.49/hour) is competitive with Fireworks ($7/hour) and Together ($6.49/hour on dedicated). Fast booting eliminates idle charges — a real cost advantage over per-minute platforms. The lack of itemized egress on the pricing page is the main forecasting gap.

Want to estimate your own Replicate bill? Use the Replicate pricing calculator to model per-second public-model time, per-output premium pricing, and dedicated GPU costs.


Pricing evolution : Replicate’s pricing history from Cog framework to multi-SKU platform

Cadence

QuarterPrice changesProduct / SKU additionsNotes
2019 Q301Replicate founded; Cog open-sourced
2021 Q401Public model catalog launched; per-second time billing
2022 Q400Series A ($17.8M) led by a16z
2023 Q301Dedicated deployments + Cog production hardening
2024 Q100Series B ($40M) at ~$350M valuation
2024 Q301Per-output image pricing (FLUX, Ideogram, SD3)
2025 Q101Per-token LLM pricing (Claude, DeepSeek)
2025 Q301L40S + multi-GPU dedicated deployments
2026 Q101Idle-time-free fine-tuning + fast booting

Tracked range: 2019 Q3–2026 Q1. Quarters not listed above were verified stable (0 price changes, 0 SKU additions).

Notable changes

  • 2021-12-08 — Public model catalog launched with per-second time billing by hardware.
  • 2023-07-28 — Dedicated deployments launched; A100 at $0.0014/sec, H100 at $0.001525/sec.
  • 2024-08-12 — Per-output pricing introduced (FLUX 1.1 Pro $0.04, FLUX Dev $0.025, Ideogram v3 $0.09).
  • 2025-02-26 — Per-token LLM pricing added for Claude 3.7 Sonnet and DeepSeek R1.
  • 2025-09-25 — L40S and multi-GPU dedicated deployments launched.
  • 2026-01-20 — Fast-booting fine-tuning launched with idle-time-free billing.

What’s unique : Replicate’s distinctive pricing mechanics

1. 50,000+ public model catalog is the largest in managed inference. No competitor approaches Replicate’s community catalog size. For AI engineers searching “is there an open-source model for X,” Replicate is the canonical entry point — and the catalog creates community-driven discovery moat that competitors cannot replicate without years of community accumulation. This is structural, not pricing — but it shapes what customers can buy.

2. Three pricing dimensions (time + output + token) on the same platform. Most platforms commit to one billing dimension; Replicate offers per-second time billing on most models, per-output image pricing on premium models, and per-token LLM pricing on select chat models. Customers can pick the billing dimension that matches their forecasting preference — a granularity unusual in the inference category.

3. Cog as open-source packaging standard. Cog (created at Replicate, MIT-licensed) is the de-facto framework for packaging PyTorch and TensorFlow models with their runtime dependencies. Cog-packaged models bill at the GPU rate — no model markup, the per-second rate is the price. This open-source-as-developer-moat strategy parallels Baseten’s Truss approach but at much larger ecosystem scale.

4. Idle-time-free fine-tuning with fast booting. Customers pay only for active training and inference time, not for cold starts or warm-pool retention. For sporadic fine-tuning workloads, this materially reduces total cost versus per-minute or per-hour competitors who bill idle time.

5. Founder credibility from Docker Compose authorship. Ben Firshman co-created Docker Compose — one of the most widely-deployed developer tools in containers. The implicit pitch “we know developer experience because we built the developer-experience standard” carries weight that competitors cannot replicate without similar founder pedigree.


Strengths & weaknesses

StrengthsWeaknesses
50,000+ public model catalog (largest in managed inference)Per-output image pricing can be 5× more expensive than per-second time billing
Three pricing dimensions (time + output + token) on same platformNetwork egress not itemized on pricing page
Cog open-source packaging is the ecosystem standardFree tier credits do not roll over month-to-month
Docker Compose founder credibility for developer experienceMulti-GPU dedicated deployments require Enterprise contract
Fast booting + idle-time-free fine-tuningPer-second public-model rates equivalent to per-hour ($5.49 H100) — not cheapest
Cog-packaged models bill at GPU rate, no model markupLacks published serverless cached input or batch API discounts

Billing UX : Replicate’s account controls and payment experience

  • Self-serve signup — Sign up at replicate.com with GitHub or email; free trial credits applied automatically. Credit card required for production usage.
  • Per-prediction usage metadata — API responses include execution time, hardware used, and cost per prediction — letting developers compute and surface real-time cost.
  • Workspace and project organization — Workspace-level usage aggregation; per-environment separation supported via API tokens.
  • Spend alerts — Configurable email alerts at $X spend per period; no hard spend caps documented.
  • Payment methods — Credit card and ACH on self-serve; wire transfer, invoice billing, and AWS/GCP Marketplace on Enterprise.
  • Annual commit pricing — Enterprise customers receive volume discounts in exchange for annual usage commitments and dedicated capacity.
  • Public model directory — Browse 50,000+ community models with cost-per-prediction estimates per hardware tier.
  • Cog CLI — Local Cog development with cog predict for testing; push to Replicate with cog push.
  • Multi-region availability — US standard; EU and APAC regions on Enterprise via dedicated deployment.

Strategic wins : Why Replicate’s pricing decisions worked

1. Cog as the open-source packaging standard built an ecosystem moat

By open-sourcing Cog in 2019 alongside the commercial platform, Replicate seeded a packaging framework that became the de-facto standard for ML model deployment. The open-source-as-developer-moat strategy parallels Docker’s own playbook — Replicate co-founder Ben Firshman literally co-created Docker Compose, which makes the analogy especially apt. Customers who package models with Cog for local development migrate naturally to Replicate for production.

2. 50,000+ public model catalog is a community discovery moat

The largest public-model catalog in managed inference makes Replicate the canonical entry point for AI engineers asking “is there an open-source model for X.” Community-driven discovery is hard to replicate — competitors with smaller catalogs face a search-and-discovery gap that compounds over time as community contributors keep adding to Replicate.

3. Three pricing dimensions (time + output + token) captured all workload preferences

By offering per-second time billing, per-output image pricing, and per-token LLM pricing on the same platform, Replicate captures customers regardless of their forecasting preference. Forecasting-sensitive teams use per-output / per-token; cost-sensitive teams use per-second time billing. The billing-dimension flexibility maximizes wallet share across workload archetypes.

4. Idle-time-free fine-tuning competed directly with serverless competitors

Fast-booting fine-tuning with no idle charges put Replicate on parity with Modal, Baseten, and other serverless-first platforms — closing a meaningful competitive gap for sporadic training workloads. The launch positioned Replicate as both “the model catalog” and “the production training and inference platform” simultaneously.


Areas to improve : Gaps in Replicate’s pricing approach

1. Per-output image pricing can be punitively expensive at scale

FLUX Dev per-output at $0.025/image equates to roughly $0.025 / 4 seconds = $0.00625/second — far more expensive than the $0.0014/sec A100 time-billed alternative. Customers who don’t realize this can pay 5× more than necessary. Surfacing the per-second equivalent for each per-output model on the pricing page would prevent the bill-shock pattern.

2. Network egress not itemized on pricing page

For high-volume customers serving large image, audio, or video payloads, egress can become a meaningful cost line. Replicate’s pricing page does not break out bandwidth pricing. Making egress pricing explicit (and ideally bundling a generous free egress allowance) would reduce a recurring source of surprise bills.

3. No published cached input or Batch API discounts

Fireworks, OpenAI, Anthropic, and Baseten all ship 50% cached input and Batch API discounts. Replicate does not — meaning RAG and agent-loop workloads with high prefix re-use cost more on Replicate than on competitors. Adding cached input discounts on per-token LLMs would close a meaningful competitive gap.

4. Multi-GPU dedicated requires Enterprise contract

Customers wanting multi-GPU deployments (training, large-context inference) must sign Enterprise contracts. Publishing a self-serve multi-GPU rate (even at limited availability) would let mid-market teams self-qualify and reduce sales-led friction for non-frontier multi-GPU workloads.


Key takeaways

  1. Open-source packaging as the structural ecosystem moat. Cog’s role as the de-facto packaging framework gives Replicate compounding adoption advantages. Infrastructure commercializations that lack an open-source packaging standard face a discoverability gap that pure-marketing positioning cannot close.

  2. Multi-dimension pricing (time + output + token) captures forecasting-sensitive AND cost-sensitive buyers. Most platforms commit to one billing dimension; Replicate offers all three on the same platform. This value-metric flexibility is the canonical solution for inference platforms targeting diverse workload archetypes.

  3. Community catalog size is a discovery moat that compounds over time. With 50,000+ public models, Replicate is the canonical “is there an open-source model for X” entry point. Newer competitors face a community-content gap that widens with each new model contribution.

  4. Founder pedigree shapes which pitches land. Ben Firshman’s Docker Compose authorship makes Replicate’s developer-experience pitch credible in a way that pure-engineering teams cannot replicate. Infrastructure commercializations should treat founder open-source visibility as a structural advantage.

  5. Idle-time-free billing is becoming table stakes for serverless inference. Modal, Baseten, and Replicate all ship some variant; competitors charging for idle time face an increasingly difficult positioning challenge. The pure-usage billing expectation has shifted decisively toward “pay only for active time.”


UBP implications

  1. Open-source packaging as the next infrastructure moat. Truss (Baseten) and Cog (Replicate) demonstrate that open-source packaging frameworks can be more durable competitive advantages than runtime optimization. Future infrastructure platforms should seriously evaluate open-source packaging as a GTM strategy.

  2. Multi-dimension pricing accommodates customer forecasting preferences. Forecasting-sensitive teams pick per-output / per-token; cost-sensitive teams pick per-second. Offering both on the same platform captures both segments without forcing a workload choice.

  3. Community catalogs scale faster than per-vendor model curation. Replicate’s 50,000+ public models compound community contribution into a discovery moat that competitors cannot replicate quickly. Inference platforms should consider opening public catalogs as a long-term ecosystem strategy, not a short-term marketing tactic.


Sources


Bottom line

Replicate priced its inference platform around three structural ideas: a 50,000+ public model catalog that makes Replicate the canonical entry point for AI engineers asking “is there an open-source model for X,” Cog as the open-source packaging framework that creates ecosystem lock-in similar to how Docker became infrastructure-standard, and three parallel pricing dimensions (per-second time billing for cost-sensitive teams, per-output image pricing for forecasting-sensitive teams, per-token LLM pricing for chat-style workloads) that capture customers regardless of billing preference. Ben Firshman’s Docker Compose co-authorship lends unusual founder credibility to the developer-experience pitch.

For AI engineering teams running diverse inference workloads — image generation, fine-tuned chat, multi-modal pipelines — Replicate’s combination of catalog breadth, packaging standard, and flexible billing dimensions makes it one of the most pragmatic commercial inference platforms. The remaining gaps (per-output pricing punitive at scale, egress not itemized, no cached input or Batch discounts, multi-GPU gated behind Enterprise) are competitive parity issues rather than structural pricing flaws.

Compare with peers via the blueprint corpus, or model your own spend with the Replicate pricing calculator.

Pricing timeline : Major events on a vertical axis

Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.

Current Live Pricing Snapshot

Live capture of the pricing and billing-docs pages: per-second public-model and dedicated GPU rates (A100 80GB $0.0014/sec, H100 $0.001525/sec), per-output image pricing (FLUX 1.1 Pro $0.04/image, FLUX Dev $0.025/image), and per-token LLM pricing (DeepSeek R1 $3.75/$10 per 1M). Idle-time-free billing with no monthly subscription fee.

Current Live Pricing Snapshot screenshot 1
Current Live Pricing Snapshot screenshot 2

Current Gallery + Hardware Pricing Structure

The latest archived snapshot retains the featured-models gallery (per-output image models such as FLUX shown prominently) above a detailed hardware- pricing table covering CPU through multi-GPU tiers for public and private deployments. Pricing remains pure-usage: per-second for hardware, per-image for select image models, and per-token for hosted LLMs, with no idle charge.

Current Gallery + Hardware Pricing Structure - The latest archived snapshot retains the featured-models gallery (per-output ima
captured

Featured-Models Gallery Layout

The pricing page shifted to a featured-models gallery leading with per-output image models (FLUX 1.1, FLUX Schnell, Imagen and others shown with per-image prices), with the per-second hardware-pricing table moved below. The reorganization put per-output model pricing first while keeping the full per-second hardware lineup available for public and private/ dedicated deployments.

Featured-Models Gallery Layout - The pricing page shifted to a featured-models gallery leading with per-output im
captured

Per-Output Image Pricing + Expanded Language Model Table

The pricing page added an 'Image models' section with per-output (per-image) pricing for premium image models, alongside a substantially expanded 'Language models' per-token table. Combined with the existing hardware table, Replicate now presented three parallel billing units: per-second (public/dedicated hardware), per-image (image models), and per-token (LLMs).

Per-Output Image Pricing + Expanded Language Model Table - The pricing page added an 'Image models' section with per-output (per-image) pri
captured

Per-Token Language Model Pricing Added: Llama 2, Mistral

A dedicated 'Language models' section appeared alongside the hardware table, introducing per-token pricing for select hosted LLMs including Llama 2 and Mistral. This marked Replicate's first move beyond pure per-second time billing toward a per-token unit for chat-style workloads, positioned as a simpler alternative to time-based billing.

Per-Token Language Model Pricing Added: Llama 2, Mistral - A dedicated 'Language models' section appeared alongside the hardware table, int
captured

Table Redesign + Per-Second Repricing and New GPU Tiers

Replicate moved pricing into a structured hardware table and substantially repriced and expanded the GPU lineup: CPU $0.000100/sec, T4 $0.000225/sec (down from $0.00055), A40 $0.000575/sec, A40 (Large) $0.000725/sec, A100 (40GB) $0.001150/sec, A100 (80GB) $0.001400/sec (down from $0.0032), and 8x Nvidia A40 (Large) at $0.005800/sec. Most per-second rates dropped while new mid-range A40 and multi-GPU tiers were introduced.

Table Redesign + Per-Second Repricing and New GPU Tiers - Replicate moved pricing into a structured hardware table and substantially repri
captured

A100 Split Into 40GB and 80GB Tiers

The hardware lineup expanded from three to four tiers as the single A100 SKU split into Nvidia A100 (40GB) at $0.0023/sec and Nvidia A100 (80GB) at $0.0032/sec (144GB system RAM, 10x CPU). CPU ($0.0002/sec) and T4 ($0.00055/sec) rates held. This added a higher-memory option for larger models while keeping the same per-second time-billing structure.

A100 Split Into 40GB and 80GB Tiers - The hardware lineup expanded from three to four tiers as the single A100 SKU spl
captured

Three-Tier Per-Second Pricing: CPU, T4, A100

The Wayback snapshot shows Replicate's pricing built on three hardware tiers billed per-second: CPU at $0.0002/sec (8GB RAM), Nvidia T4 GPU at $0.00055/sec (16GB GPU RAM), and Nvidia A100 GPU at $0.0023/sec (40GB GPU RAM). The page framed it as 'GPUs are expensive, so why leave them on? Pay by the second.' Minimum billable time was 1 second, with no charge to sign up and no charge for canceled-before-start predictions.

Three-Tier Per-Second Pricing: CPU, T4, A100 - The Wayback snapshot shows Replicate's pricing built on three hardware tiers bil
captured
Trivia
  • · Replicate's per-second public-model billing means a 4-second FLUX Dev image generation on an A100 costs roughly $0.0056 — finer granularity than competitors' per-image flat rates, though the per-image SKU ($0.025 for FLUX Dev) is still published as a simpler alternative.
  • · Replicate was founded in 2019 by Ben Firshman (creator of Docker Compose at Docker) and Andreas Jansson — making it the rare AI infrastructure platform where the founder co-created the single most-used developer tool in containers.
  • · Cog, Replicate's open-source model-packaging framework, predates the company's commercial inference platform by two years — and remains the de-facto standard for packaging ML models with PyTorch and TensorFlow runtimes, similar to how Truss became Baseten's developer wedge.

Questions & answers

How much does Replicate cost per month?
Replicate has no monthly subscription fee — you pay only for the per-second execution time of public and dedicated models, plus any per-output or per-token pricing for premium models. A typical FLUX Dev workload generating 1,000 images/month would cost $25 (at $0.025/image) or roughly $5–$8 per-second time-billed depending on inference latency.
What is the difference between Replicate public models and dedicated deployments?
Public models are multi-tenant — they share GPU capacity with other customers and billing is per-second of execution time on the underlying hardware. Dedicated deployments are single-tenant: your own GPU instance with guaranteed throughput, billed per-second only when active (no idle charges). Public models are cheaper for sporadic workloads; dedicated wins at sustained high QPS.
What are Replicate's per-second GPU rates?
Dedicated deployment per-second rates: CPU Small $0.000025, CPU $0.0001, T4 $0.000225, L40S $0.000975, A100 80GB $0.0014, H100 $0.001525. Translated to per-hour: A100 ~$5.04/hr, H100 ~$5.49/hr. Multi-GPU deployments available with committed spend contracts.
Does Replicate have per-output or per-token pricing?
Yes for select premium models. Per-image: FLUX 1.1 Pro $0.04, FLUX Dev $0.025, Ideogram v3 $0.09. Per-token: Claude 3.7 Sonnet $3.00 input / $15 output per 1M, DeepSeek R1 $3.75 input / $10 output per 1M. Most public models default to per-second time billing.
What is Cog and how does it relate to pricing?
Cog is Replicate's open-source model-packaging framework (created at Replicate, MIT-licensed). It packages PyTorch / TensorFlow models with their runtime dependencies for deployment to Replicate. Cog-packaged models bill at the per-second rate of the underlying hardware — no model markup, the GPU rate is the price.
Does Replicate offer a free tier?
Yes — new accounts receive free monthly credits for evaluation. The free tier covers light evaluation of public models and lightweight fine-tuning experiments; production workloads require credit card on file. Free credits do not roll over month-to-month.