GPU-Hour Pricing: Examples & Companies

What is it

GPU-Hour Pricing is a billing unit where customers are charged for GPU time consumed, typically measured per-second or per-hour by GPU type.

A GPU-hour is the most literal unit in AI infrastructure: one GPU, running for one hour. The bill is the per-hour rate for a specific accelerator — an NVIDIA H100, A100, B200, L40S, RTX 4090 — multiplied by the time that GPU is reserved to your workload. Unlike per-token or per-request meters, which charge for the output of computation, GPU-hours charge for the machine itself, whether it is saturated or idle. That makes the unit the native cost driver of training, fine-tuning, and any inference workload where the customer holds a GPU full-time rather than sharing it.

The rate is never one number. It is a function of GPU model and, increasingly, of reliability, commitment, and packaging. RunPod publishes one of the widest single rate cards in the corpus, spanning an RTX A5000 at $0.27/hr through a B300 (288 GB HBM3e) at $7.39/hr on its Secure Cloud ladder. Vast.ai starts even lower — an RTX 5060 Ti at $0.194/hr — because it is a marketplace where individual hosts set the price, not the platform. At the other end, Together AI lists a dedicated HGX B200 endpoint at $11.95/hr, and Fireworks AI prices a B300 at $12.00/hr on-demand.

The spread on identical silicon is the headline. An H100 in this corpus runs about $1.50/GPU-hr on Hyperbolic’s marketplace, $1.70/GPU-hr on Novita AI’s bare-metal nodes, and $1.79/GPU-hr on DeepInfra — but $3.85 on-demand at Nebius, $3.99 at Lambda, $6.16 per GPU inside a CoreWeave HGX node, and $6.49–$7.00 on Together AI and Fireworks AI dedicated endpoints. Same chip, ~4.7x price range, driven almost entirely by how much managed software, reliability, and support wraps the raw GPU.

Almost every platform that quotes an “hour” actually meters finer. Modal publishes its H100 at $0.001097 per second (~$3.95/hr) and bills containers down to the second; Replicate bills an A100 at $0.0014/sec and an H100 at $0.001525/sec; Baseten quotes per minute ($0.10833/min H100, ~$6.50/hr). The hourly figure is a presentation convention layered on top of fine-grained metering — which matters because real inference jobs often run for seconds, not hours, and a per-hour rounding rule would massively overcharge them.

One H100-hour — a ~4.7× span set by the software wrap

How it works

The core formula is trivial: GPU cost equals the per-GPU-hour rate for the chosen GPU type, multiplied by the time the GPU is held (almost always metered per second and converted). The complexity lives in the dimensions wrapped around that formula — which GPU, which reliability/commitment tier, on-demand versus spot versus reserved, whether the rate is a single card or a market, and whether storage and bandwidth ride alongside.

Dimension	What it controls	Example from this corpus
GPU model	The base rate — frontier chips cost multiples of older ones	RunPod: RTX 4090 $0.69/hr → H100 PCIe $2.89/hr → B300 $7.39/hr
Reliability tier	SLA-backed capacity vs cheaper partner/spot capacity	RunPod splits Secure Cloud vs Community Cloud; CoreWeave lists spot at ~60% off on-demand
Commitment / spot	On-demand vs reserved vs preemptible	Together AI H100 $3.99 on-demand → $3.09/hr floor reserved; Nebius H100 $3.85 on-demand vs $2.15 preemptible
Marketplace vs list	Platform-set price vs host-set floating rate	RunPod & Fireworks publish fixed cards; Vast.ai & Hyperbolic float rates set by third-party supply
Metering granularity	Per-second vs per-minute vs per-hour	Modal & Replicate bill per second; Baseten quotes per minute ($0.10833/min H100)
Bundled meters	Storage and bandwidth charged alongside the GPU	Vast.ai adds $/GB/hr storage (billed even when stopped) + $/TB bandwidth on top of the GPU rate

The dimension that dominates the bill is the reliability/commitment tier: a preemptible or spot rate is 45–60% below on-demand, and reserved commitments discount further. The worked example below makes that trade explicit.

Unit math: A fine-tuning job on a single H100 for 3 hours costs 3 × $2.89 = $8.67 on RunPod’s Secure Cloud. The same 3 hours on Modal, billed per second, is 10,800 sec × $0.001097 = $11.85. Reserve a B200 on Together AI instead of paying on-demand and a 100-hour run drops from 100 × $8.19 = $819 to 100 × $7.99 = $799, while a preemptible H100 on Nebius costs 100 × $2.15 = $215 versus 100 × $3.85 = $385 on-demand — the reliability trade made explicit.

The reserved-and-preemptible discounting is the same lever the broader infrastructure market is pulling — see the infrastructure commitment-discount trend. And because per-token inference keeps getting cheaper (the token-price-deflation trend), GPU-hour pricing increasingly competes against per-token serverless for the same inference workload: you rent the GPU only when keeping it busy beats paying per token. To model dedicated-GPU economics against usage rates, see the pricing calculator hub.

Companies using this

Twenty-seven companies in the corpus meter GPU-hours. They cluster into four groups: dedicated GPU clouds (CoreWeave, Lambda, Nebius, RunPod, Together AI, DeepInfra, Novita AI); marketplaces where hosts set the rate (Vast.ai, Hyperbolic); serverless compute platforms that bill GPU time per second (Modal, Replicate, Lightning AI, BentoML); and inference/application/data platforms that expose dedicated GPU rates alongside other meters (Baseten, Fireworks AI, Fal, Anyscale, Predibase, Databricks (Mosaic AI), Hugging Face, Cerebras, Inflection AI, Midjourney, Roboflow, Comet, LanceDB, Milvus).

Patterns observed

The rate card is the product positioning. GPU-hour pricing is one of the few meters where the price list directly signals the target customer. RunPod puts a $0.69/hr RTX 4090 next to a $7.39/hr B300 to capture hobbyists and frontier teams on one platform; Vast.ai starts at $0.194/hr by being a marketplace rather than an operator; Hyperbolic advertises an H100 from $1.50/hr off aggregated third-party supply. At the other extreme, Fireworks AI at $7.00/hr per H100 and Together AI at $6.49/hr dedicated price for teams buying managed inference, not bare metal — the higher sticker buys a runtime, not just silicon.
The cross-vendor H100 spread is ~4.7x on identical silicon. The full price ladder is in the definition above; the takeaway is that the three tiers map to three products off the same chip — marketplaces and open-model clouds near hardware cost, transparent neoclouds adding tooling and reliability, and managed-inference platforms charging for the software runtime. The chip is a commodity; the wrapper is the product.
“Per hour” is a label; per second is the meter. Fal, Modal, and Replicate all bill sub-second-to-second and only quote hourly for readability, while Baseten splits the difference at per-minute. The finer the granularity, the better the unit fits short inference jobs — which is exactly the scale-to-zero workload these platforms court, and where an hour-rounded rate would massively overcharge.
Spot, preemptible, and reserved tiers are becoming table stakes. CoreWeave’s spot H100 node at $19.71/hr against $49.24 on-demand (~60% off) and Nebius’s preemptible H100 at $2.15 versus $3.85 (~45% off) both monetize the reliability trade at the compute level, while Together AI’s reserved cluster H100 falls to a $3.09/hr floor on a 91–180 day commit. This mirrors the infrastructure commitment-discount standard: the more you commit or the more interruption you tolerate, the lower the effective GPU-hour.
The same GPU is sold at multiple prices by packaging. Novita AI lists an H100 at $1.99/GPU-hr as a dedicated endpoint and $1.70/GPU/hr on an 8-GPU bare-metal node; Nebius exposes a self-serve Explorer Tier from $1.50/GPU-hr (capped at 1,000 GPU-hours/month) alongside its full on-demand card. The GPU-hour is one unit, but vendors slice it by product surface to serve hobbyists, teams, and enterprises off one fleet.
GPU-hours rarely travel alone at application platforms. Midjourney wraps GPU-hours inside fast/relax subscription bundles; Roboflow folds GPU inference into a unified credit meter; Cerebras leads with per-token inference and exposes wafer-scale system time only for dedicated capacity. Higher up the stack, the GPU-hour becomes an input the buyer rarely transacts in directly.

Counterexamples & variants

The GPU-hour looks like a single commodity meter, but three structural variants stress the model in different directions: rates that are discovered rather than listed, rates that recede behind another meter, and rates that move up against the deflation narrative.

Vast.ai is the variant that breaks the “rate card” assumption entirely. It is a true marketplace with no vendor-set price list — every $/hr is set by the individual host machine and floats with supply and demand, so the same RTX 4090 can cost $0.40 on one host and $0.875 on another. A $0.194/hr GPU there is not a posted price the way a RunPod rate is; it is a market-clearing rate that can move, and interruptible-spot capacity can be reclaimed mid-job. Worse for buyers, storage keeps billing on stopped instances ($/GB/hr for every second the instance exists) and bandwidth is metered per TB but invisible in the displayed $/hr — Vast’s own FAQ admits users get “charged more per hour than expected.” Hyperbolic is the milder version of the same idea: it publishes starting rates (H100 from $1.50/hr) but refreshes them weekly from aggregated third-party supply, so the number is a floor, not a fixed contract.

Midjourney, Roboflow, and Cerebras show where the GPU-hour recedes behind another meter entirely. Midjourney meters GPU-hours internally but never exposes a per-hour rate to the buyer — it sells subscription tiers (Basic $10 through Mega $120/mo) that bundle a pool of “fast GPU-hours” plus unlimited “relax” generation, so a buyer optimizing cost reasons about fast-hours consumed, not dollars per H100. Roboflow runs GPU inference and training but bills a unified credit — see credit-based billing — so the GPU-hour is an implementation detail under a credit abstraction. Cerebras leads with per-token inference and surfaces wafer-scale system time only for dedicated capacity. In all three cases GPU-hour billing exists, but it is not the unit the buyer transacts in.

Lambda is the counterexample to the assumption that GPU-hours only ever fall. Despite the industry deflation narrative — DeepInfra cut its A100 from $2.00 (2024) to $0.89 (2025), and Novita AI cut its RTX 4090 on-demand to $0.35/hr — Lambda raised its on-demand H100 SXM from $2.99 to $3.99/GPU-hr between 2025 and 2026 as Microsoft- and superintelligence-scale demand outran capacity. Nebius’s B200 on-demand rate similarly climbed as AI demand outstripped supply. GPU-hour pricing is not a one-way deflation story: for scarce frontier silicon, the meter can move up even as commodity older cards keep dropping. Finally, CoreWeave is the reminder that a headline rate can mislead — its $49.24/hr HGX H100 sounds expensive until you realize it is eight GPUs (about $6.16 each), so per-GPU normalization is mandatory before any cross-vendor comparison.

What this means for buyers vs vendors

For buyers

Normalize every quote to the same GPU model and the same unit before comparing — a per-GPU-hr rate (Lambda $3.99, Nebius $3.85), a per-node rate (CoreWeave $49.24 for 8 GPUs = $6.16 each), and a per-second rate (Modal $0.001097/sec ≈ $3.95/hr) all describe an H100 but look nothing alike on a pricing page. The ~4.7x cross-vendor spread is real, but the cheapest sticker pays for the least managed software, so map it to what you actually need: a marketplace like Vast.ai or Hyperbolic wins on raw price if you can tolerate floating rates and reclaimable capacity, while a managed platform like Fireworks AI or Together AI charges more but hands you a runtime.

Insist on the metering granularity: a “$2.89/hr” rate billed per second behaves very differently from one rounded to the hour for a 90-second inference call. If your workload is steady and long-running, price the reserved and preemptible tiers before paying on-demand — the same reliability trade shown in the unit-math above can halve the effective rate — and weigh the GPU-hour against per-token serverless, since cheap dedicated GPUs only win when you keep them busy. Finally, account for the hidden meters: Vast.ai bills storage even on stopped instances and hides bandwidth outside the displayed $/hr, and an idle reserved GPU still bills full price. See choosing the right usage metric and the introduction to usage-based pricing for the framing.

For vendors

GPU-hour pricing is the most legible meter you can offer infrastructure buyers — it maps one-to-one to your own hardware cost — but legibility is also a margin trap, because customers can compare your H100 rate to everyone else’s in seconds. Several levers create differentiation without a pure price war. Granularity is the cheapest: per-second billing (like Modal and Replicate) and per-minute billing (like Baseten) win short-job workloads that per-hour competitors overcharge, and make scale-to-zero economics legible. Tiering is the second: a reliability split (like RunPod’s Secure/Community fleets or CoreWeave’s spot-vs-on-demand) and a preemptible rate (like Nebius’s ~45%-cheaper tier) let you serve both production and best-effort demand off one fleet. Packaging is the third: Novita AI sells the same H100 at instance, dedicated-endpoint, and bare-metal prices, and Nebius uses a self-serve Explorer Tier from $1.50/GPU-hr as a developer-acquisition wedge.

If you want to escape rate-card comparison entirely, move up the stack: wrap the GPU-hour in a credit (Roboflow) or a subscription bundle (Midjourney) so the buyer transacts in your unit, not the market’s. And read the demand curve before assuming rates only fall — Lambda raised its on-demand H100 from $2.99 to $3.99 as frontier demand outran supply, capturing margin from buyers who value reliability over the absolute lowest sticker. Whatever you choose, you need per-second attribution of reserved GPU time to a tenant — heavier than counting requests. For the commitment-discount mechanics buyers now expect, see billing cycles and invoicing.

Company	Product	Pricing model	Billing units	Free tier	Verified
Anyscale	Managed Ray platform for distributed AI training, inference, and batch processing (RayTurbo, Anyscale Compute Units)	pure-usage commitment hybrid	gpu-hours cpu-hours credits	Yes	2026-05-29
Baseten	ML inference infrastructure — dedicated GPU deployments, Model APIs, and Truss framework	pure-usage hybrid commitment	gpu-hours tokens requests	Yes	2026-05-29
BentoML	BentoCloud — managed model-serving & inference platform	pure-usage freemium commitment	gpu-hours cpu-hours	Yes	2026-06-15
Cerebras	Wafer-scale AI inference cloud and WSE hardware systems	pure-usage subscription commitment	tokens api-calls gpu-hours	Yes	2026-05-30
Comet	AI/ML observability and experiment-tracking platform — Opik (LLM/agent observability) and Comet MLOps (experiment tracking)	freemium seat-based hybrid	seats gpu-hours storage-gb	Yes	2026-06-02
CoreWeave	GPU cloud & AI compute infrastructure	pure-usage commitment	gpu-hours cpu-hours storage-gb	No	2026-06-15
Databricks (Mosaic AI)	Mosaic AI — enterprise GenAI & ML on the Data Intelligence Platform	pure-usage commitment	units tokens gpu-hours	Yes	2026-06-15
DeepInfra	Serverless inference cloud — per-token LLM/embedding APIs, per-image and per-minute media models, per-hour on-demand GPU containers, and reserved DeepCluster GPU clusters	pure-usage commitment	tokens gpu-hours requests	No	2026-07-14
Fal	Generative-media inference platform — serverless per-output model APIs plus dedicated GPU compute	pure-usage	gpu-hours requests media-minutes	No	2026-06-01
Fireworks AI	Generative AI inference platform — serverless per-token, on-demand GPU, fine-tuning, batch API	pure-usage hybrid commitment	tokens gpu-hours requests	Yes	2026-05-30
Hugging Face	AI model hub, inference endpoints & compute	hybrid seat-based pure-usage	seats gpu-hours cpu-hours	Yes	2026-06-15
Hyperbolic	GPU cloud marketplace & serverless AI inference	pure-usage commitment	gpu-hours tokens images	Yes	2026-06-15
Inflection AI	Enterprise foundation models (Inflection 3.0) + Pi assistant	pure-usage subscription	tokens gpu-hours seats	No	2026-06-11
Lambda	GPU cloud & AI compute infrastructure	pure-usage commitment	gpu-hours	No	2026-06-09
LanceDB	AI-native multimodal lakehouse	freemium pure-usage commitment	storage-gb vectors-indexed gpu-hours	Yes	2026-06-09
Lightning AI	Cloud GPU/CPU Studio compute platform for building, training, and serving AI models, billed by the second with a credit pool.	hybrid freemium pure-usage	gpu-hours cpu-hours credits	Yes	2026-06-02
Midjourney	AI image and video generation via subscription with GPU-hour metering	subscription	gpu-hours credits	No	2026-05-29
Milvus	Vector database (OSS) + Zilliz Cloud (managed)	pure-usage freemium commitment	gpu-hours storage-gb vectors-indexed	Yes	2026-06-09
Modal	Serverless compute and GPU platform — per-second billing for Python functions, batch jobs, and model serving	pure-usage freemium subscription	gpu-hours cpu-hours gb-hours	Yes	2026-07-14
Nebius	AI cloud & GPU compute infrastructure	pure-usage commitment	gpu-hours cpu-hours storage-gb	No	2026-06-15
Novita AI	Pay-as-you-go AI cloud: 200+ model inference APIs, on-demand GPUs, and per-second agent sandboxes under one API	pure-usage freemium	tokens gpu-hours cpu-hours	Yes	2026-07-06
Predibase	Fine-tuning & serving platform for open-source LLMs	pure-usage freemium	tokens gpu-hours	Yes	2026-06-15
Replicate	Cloud platform for running, fine-tuning, and deploying AI models via REST API	pure-usage hybrid commitment	gpu-hours tokens requests	Yes	2026-05-30
Roboflow	Computer-vision platform (dataset management, model training, deployment)	hybrid freemium	credits seats gpu-hours	Yes	2026-07-14
RunPod	GPU cloud marketplace — Secure Cloud and Community Cloud Pods, Serverless endpoints, and persistent storage	pure-usage hybrid commitment	gpu-hours storage-gb	No	2026-07-14
Together AI	AI Acceleration Cloud — serverless inference, dedicated endpoints, GPU clusters, Code Sandbox, fine-tuning	pure-usage hybrid commitment	tokens gpu-hours cpu-hours	Yes	2026-07-14
Vast.ai	GPU rental marketplace — on-demand, interruptible (spot), and reserved cloud GPUs plus autoscaling serverless inference	pure-usage commitment	gpu-hours storage-gb bandwidth-gb	No	2026-07-14

Explore this theme in the knowledge graph

FAQ

What is a GPU-hour in pricing?

A GPU-hour is one GPU running for one hour. The charge is the per-hour rate for a specific GPU type (such as an H100 or A100) multiplied by the time the GPU is reserved. Most platforms meter this per second and only express the rate in hourly terms for readability.

How much does an H100 cost per hour?

It varies widely by vendor and packaging. In this corpus an H100 ranges from about $1.50/GPU-hr on Hyperbolic's marketplace and $1.70/GPU-hr on Novita's bare-metal nodes, through $3.85 on-demand at Nebius and $3.99 at Lambda, up to $6.49–$7.00/hr on Together AI and Fireworks AI dedicated endpoints — a roughly 4.7x spread on identical silicon.

What is the difference between on-demand, spot, and reserved GPU pricing?

On-demand bills the full rate for capacity you can use immediately. Spot (or preemptible/interruptible) bills less but can be reclaimed by the provider — CoreWeave's H100 node is $19.71/hr on spot vs $49.24 on-demand, and Nebius runs a preemptible H100 at $2.15 vs $3.85. Reserved commits you to a term for a discount — Together AI lists a B200 at $8.19/hr on-demand versus $7.99/hr reserved, with H100 dropping to a $3.09/hr floor on a 91–180 day reservation.

Why is GPU billed by the hour instead of by tokens or requests?

Per-GPU-hour pricing charges for the hardware itself rather than the work it produces. It suits training, fine-tuning, and dedicated inference where the customer controls the GPU full-time. Per-token pricing fits shared serverless inference where the provider amortizes the GPU across many tenants.

Which companies use GPU-hour pricing?

In this corpus 27 companies meter GPU-hours, including GPU clouds like CoreWeave, Lambda, Nebius, RunPod, and Together AI, marketplaces like Vast.ai and Hyperbolic, serverless platforms like Modal and Replicate, and inference platforms like Baseten, Fireworks AI, DeepInfra, and Fal.

Are GPU-hour rates going up or down?

Both, depending on the tier. Serverless and inference-cloud rates keep falling — DeepInfra cut its A100 from $2.00 (2024) to $0.89 (2025), and Novita cut its RTX 4090 to $0.35/hr. But scarce frontier capacity is rising: Lambda raised its on-demand H100 SXM from $2.99 to $3.99 in a year as demand outran supply.

Related billing units

Back to companies