What is it
GPU-Hour Pricing is a billing unit where customers are charged for GPU time consumed, typically measured per-second or per-hour by GPU type.
A GPU-hour is the most literal unit in AI infrastructure: one GPU, running for one hour. The bill is the per-hour rate for a specific accelerator — an NVIDIA H100, A100, L40, RTX 4090 — multiplied by the time that GPU is reserved to your workload. Unlike per-token or per-request meters, which charge for the output of computation, GPU-hours charge for the machine itself, whether it is saturated or idle. That makes the unit the native cost driver of training, fine-tuning, and any inference workload where the customer holds a GPU full-time rather than sharing it.
The rate is never one number. It is a function of GPU model and, increasingly, of reliability and commitment tier. RunPod publishes the widest single rate card in the corpus, spanning a hobbyist RTX 4090 at $0.69/hr through a frontier B200 at $5.89/hr, split across a Secure Cloud / Community Cloud reliability tier. Together AI lists a B200 at $11.95/hr on-demand against $4.99/hr reserved — the same chip, less than half the price, in exchange for a commitment. The spread between vendors is just as large: an H100 runs about $1.98/hr at DeepInfra and up to $7.00/hr on Fireworks AI’s dedicated deployments.
Almost every platform that quotes an “hour” actually meters per second. Modal publishes its H100 at $0.000694 per second and bills containers down to the second; Replicate bills an A100 at $0.001525 per second. The hourly figure is a presentation convention layered on top of fine-grained metering — which matters because real inference jobs often run for seconds, not hours, and a per-hour rounding rule would massively overcharge them.
How it works
The core formula is trivial: GPU cost equals the per-GPU-hour rate for the chosen GPU type, multiplied by the time the GPU is held (almost always metered per second and converted). The complexity lives in the dimensions wrapped around that formula — which GPU, which reliability tier, on-demand versus spot versus reserved, and whether storage and bandwidth ride alongside.
| Dimension | What it controls | Example from this corpus |
|---|---|---|
| GPU model | The base rate — frontier chips cost multiples of older ones | RunPod: RTX 4090 $0.69/hr → H100 $2.89/hr → B200 $5.89/hr |
| Reliability tier | SLA-backed capacity vs cheaper partner/spot capacity | RunPod splits Secure Cloud vs Community Cloud; Vast.ai offers interruptible spot bids |
| Commitment | On-demand vs reserved/term pricing | Together AI: B200 $11.95/hr on-demand vs $4.99/hr reserved |
| Metering granularity | Per-second vs per-minute vs per-hour | Modal & Replicate bill per second; Baseten quotes per-minute ($0.10833/min H100) |
| Bundled meters | Storage and bandwidth charged alongside the GPU | Vast.ai adds $/GB/hr storage + $/TB bandwidth on top of the GPU rate |
The cross-vendor spread on identical silicon is the headline. An H100 in this corpus ranges from roughly $1.98/hr at DeepInfra and about $2.50/hr-equivalent on Modal’s per-second meter, up to $7.00/hr on Fireworks AI — a 3.5x difference driven almost entirely by how much managed inference software is wrapped around the raw GPU. Marketplaces sit at the bottom: Vast.ai lists GPUs from $0.194/hr because it matches buyers to third-party hosts and lets them bid for interruptible capacity.
Unit math: A fine-tuning job on a single H100 for 3 hours costs 3 × $2.89 = $8.67 on RunPod’s Secure Cloud. The same 3 hours on Modal, billed per second, is 10,800 sec × $0.000694 = $7.50. Reserve a B200 on Together AI instead of paying on-demand and a 100-hour run drops from 100 × $11.95 = $1,195 to 100 × $4.99 = $499 — the commitment trade made explicit.
The reserved-vs-on-demand gap is the same lever the broader infrastructure market is pulling — see the infrastructure commitment-discount trend. And because per-token inference keeps getting cheaper (the token-price-deflation trend), GPU-hour pricing increasingly competes against per-token serverless for the same inference workload: you rent the GPU only when keeping it busy beats paying per token. To model dedicated-GPU economics against usage rates, see the pricing calculator hub.
Companies using this
Sixteen companies in the corpus meter GPU-hours. They split into three groups: raw GPU clouds and marketplaces (RunPod, Vast.ai, Together AI, DeepInfra, Novita AI), serverless compute platforms that bill GPU time per second (Modal, Replicate, Lightning AI), and inference/application platforms that expose dedicated GPU rates alongside other meters (Baseten, Fireworks AI, Fal, Anyscale, Cerebras, Midjourney, Roboflow, Comet).
Patterns observed
-
The rate card is the product positioning. GPU-hour pricing is one of the few meters where the price list directly signals the target customer. RunPod puts a $0.69/hr RTX 4090 next to a $5.89/hr B200 to capture hobbyists and frontier teams on one platform; Vast.ai starts at $0.194/hr by being a marketplace rather than an operator. Fireworks AI at $7.00/hr per H100 and Together AI at $6.49/hr dedicated price for teams buying managed inference, not bare metal.
-
“Per hour” is a label; per second is the meter. Modal ($0.000694/sec H100), Replicate ($0.001525/sec A100), and Fal all bill sub-second-to-second and only quote hourly for readability. Baseten splits the difference with per-minute billing ($0.10833/min H100). The finer the granularity, the better the unit fits short inference jobs — which is exactly the workload these platforms court.
-
The reliability tier is the second axis after GPU model. RunPod’s Secure Cloud (SLA-backed) versus Community Cloud (partner-operated, cheaper) and Vast.ai’s on-demand versus interruptible spot bids both monetize the same trade: production reliability costs more than best-effort capacity. The GPU type sets the floor; the reliability tier sets the multiplier.
-
Commitment discounts are becoming table stakes. Together AI’s reserved B200 at $4.99/hr against $11.95/hr on-demand, and Anyscale’s cash-vs-committed-spend rates, both mirror the infrastructure commitment-discount standard: the bigger and longer the commit, the lower the effective GPU-hour.
-
GPU-hours rarely travel alone at application platforms. Midjourney wraps GPU-hours inside fast/relax subscription bundles; Roboflow folds GPU inference into a unified credit meter; Cerebras leads with per-token inference and exposes GPU/system time only for dedicated capacity. Higher up the stack, the GPU-hour becomes an input the buyer rarely sees directly.
Counterexamples & variants
The most important caveat is who is missing from this page. The best-known GPU-hour brands — CoreWeave, Lambda Labs, Hyperbolic, Nebius — are tracked but not yet in_corpus, so they do not appear in the table even though they are canonical examples of the model. The 16 companies here are the corpus-verified set, not the entire universe of GPU clouds; treat the list as a representative sample of how the meter is structured, not a market census.
Midjourney is the clearest variant. It meters GPU-hours internally but never exposes a per-hour rate to the buyer. Instead it sells subscription tiers (Basic $10 through Mega $120/mo) that bundle a pool of “fast GPU-hours” plus unlimited “relax” generation. The GPU-hour is real and is the cost driver, but the pricing surface is a flat subscription with a dual-speed allowance — the opposite of RunPod’s transparent rate card. A buyer optimizing Midjourney cost reasons about fast-hours consumed, not dollars per H100.
Roboflow and Cerebras show where the GPU-hour recedes behind another meter entirely. Roboflow runs GPU inference and training but bills a unified credit — see credit-based billing — so the GPU-hour is an implementation detail under a credit abstraction. Cerebras leads with per-token inference (GPT-OSS-120B at $0.35/$0.75 per million tokens) and surfaces wafer-scale system time only for dedicated capacity. In both cases GPU-hour billing exists, but it is not the unit the buyer transacts in.
Vast.ai is the variant that stresses the model in the other direction: a true marketplace where the GPU-hour rate is dynamic, set by third-party hosts and adjustable via interruptible spot bids, rather than published by the platform. A $0.194/hr GPU there is not a posted price the way a RunPod rate is — it is a market-clearing rate that can move, and the cheapest capacity can be reclaimed mid-job. The GPU-hour is still the unit, but its price is discovered, not listed.
What this means for buyers vs vendors
For buyers
Normalize every quote to the same GPU model and the same granularity before comparing — the 3.5x H100 spread between DeepInfra and Fireworks AI is real, but it pays for different amounts of managed software, so the cheapest sticker is not always the cheapest total. Insist on the metering unit: a “$2.89/hr” rate billed per second behaves very differently from one rounded to the hour for a 90-second inference call. If your workload is steady and long-running, price the reserved tier — Together AI’s reserved B200 is less than half its on-demand rate — and weigh the GPU-hour against per-token serverless, since cheap dedicated GPUs only win when you keep them busy. Finally, account for the hidden meters: Vast.ai and other clouds add storage and bandwidth on top of the GPU line, and an idle reserved GPU still bills full price. See choosing the right usage metric and the introduction to usage-based pricing for the framing.
For vendors
GPU-hour pricing is the most legible meter you can offer infrastructure buyers — it maps one-to-one to your own hardware cost — but legibility is also a margin trap, because customers can compare your H100 rate to everyone else’s in seconds. Two levers create differentiation without a price war: granularity and tiering. Per-second billing (like Modal and Replicate) wins short-job workloads that per-hour competitors overcharge; a reliability tier (like RunPod’s Secure/Community split) lets you serve both production and best-effort demand off one fleet. If you want to escape rate-card comparison entirely, move up the stack: wrap the GPU-hour in a credit (Roboflow) or a subscription bundle (Midjourney) so the buyer transacts in your unit, not the market’s. Whatever you choose, you need per-second attribution of reserved GPU time to a tenant — heavier than counting requests. For the commitment-discount mechanics buyers now expect, see billing cycles and invoicing.
| Company | Product | Pricing model | Billing units | Free tier | Verified |
|---|---|---|---|---|---|
| Anyscale | Managed Ray platform for distributed AI training, inference, and batch processing (RayTurbo, Anyscale Compute Units) | pure-usagecommitmenthybrid | gpu-hourscpu-hourscredits | Yes | 2026-05-29 |
| Baseten | ML inference infrastructure — dedicated GPU deployments, Model APIs, and Truss framework | pure-usagehybridcommitment | gpu-hourstokensrequests | Yes | 2026-05-29 |
| Cerebras | Wafer-scale AI inference cloud and WSE hardware systems | pure-usagesubscriptioncommitment | tokensapi-callsgpu-hours | Yes | 2026-05-30 |
| Comet | AI/ML observability and experiment-tracking platform — Opik (LLM/agent observability) and Comet MLOps (experiment tracking) | freemiumseat-basedhybrid | seatsgpu-hoursstorage-gb | Yes | 2026-06-02 |
| DeepInfra | Serverless inference cloud — per-token LLM/embedding APIs, per-image and per-minute media models, per-hour on-demand GPU containers, and reserved DeepCluster GPU clusters | pure-usagecommitment | tokensgpu-hoursrequests+1 | No | 2026-06-02 |
| Fal | Generative-media inference platform — serverless per-output model APIs plus dedicated GPU compute | pure-usage | gpu-hoursrequestsmedia-minutes | No | 2026-06-01 |
| Fireworks AI | Generative AI inference platform — serverless per-token, on-demand GPU, fine-tuning, batch API | pure-usagehybridcommitment | tokensgpu-hoursrequests | Yes | 2026-05-30 |
| Lightning AI | Cloud GPU/CPU Studio compute platform for building, training, and serving AI models, billed by the second with a credit pool. | hybridfreemiumpure-usage | gpu-hourscpu-hourscredits+3 | Yes | 2026-06-02 |
| Midjourney | AI image and video generation via subscription with GPU-hour metering | subscription | gpu-hourscredits | No | 2026-05-29 |
| Modal | Serverless compute and GPU platform — per-second billing for Python functions, batch jobs, and model serving | pure-usagefreemiumsubscription+1 | gpu-hourscpu-hoursgb-hours+2 | Yes | 2026-05-29 |
| Novita AI | Pay-as-you-go AI cloud: 200+ model inference APIs, on-demand GPUs, and per-second agent sandboxes under one API | pure-usagefreemium | tokensgpu-hourscpu-hours+2 | Yes | 2026-06-02 |
| Replicate | Cloud platform for running, fine-tuning, and deploying AI models via REST API | pure-usagehybridcommitment | gpu-hourstokensrequests | Yes | 2026-05-30 |
| Roboflow | Computer-vision platform (dataset management, model training, deployment) | hybridfreemium | creditsseatsgpu-hours | Yes | 2026-06-02 |
| RunPod | GPU cloud marketplace — Secure Cloud and Community Cloud Pods, Serverless endpoints, and persistent storage | pure-usagehybridcommitment | gpu-hoursstorage-gb | No | 2026-05-30 |
| Together AI | AI Acceleration Cloud — serverless inference, dedicated endpoints, GPU clusters, Code Sandbox, fine-tuning | pure-usagehybridcommitment | tokensgpu-hourscpu-hours+1 | Yes | 2026-05-29 |
| Vast.ai | GPU rental marketplace — on-demand, interruptible (spot), and reserved cloud GPUs plus autoscaling serverless inference | pure-usagecommitment | gpu-hoursstorage-gbbandwidth-gb | No | 2026-06-02 |
FAQ
What is a GPU-hour in pricing?
A GPU-hour is one GPU running for one hour. The charge is the per-hour rate for a specific GPU type (such as an H100 or A100) multiplied by the time the GPU is reserved. Most platforms meter this per second and only express the rate in hourly terms for readability.
How much does an H100 cost per hour?
It varies widely by vendor and tier. In this corpus an H100 ranges from about $1.98/hr at DeepInfra and roughly $2.50/hr-equivalent on Modal's per-second meter up to $7.00/hr on Fireworks AI's dedicated deployments — a 3.5x spread driven by how much managed software wraps the raw GPU.
What is the difference between on-demand, spot, and reserved GPU pricing?
On-demand bills the full rate for capacity you can use immediately. Spot (or interruptible) bills less but can be reclaimed by the provider, as on the Vast.ai marketplace. Reserved commits you to a term in exchange for a discount — Together AI lists a B200 at $11.95/hr on-demand versus $4.99/hr reserved.
Why is GPU billed by the hour instead of by tokens or requests?
Per-GPU-hour pricing charges for the hardware itself rather than the work it produces. It suits training, fine-tuning, and dedicated inference where the customer controls the GPU full-time. Per-token pricing fits shared serverless inference where the provider amortizes the GPU across many tenants.
Which companies use GPU-hour pricing?
In this corpus 16 companies meter GPU-hours, including GPU clouds and marketplaces like RunPod, Vast.ai, Lambda-class clouds, Together AI, and DeepInfra, serverless platforms like Modal and Replicate, and inference platforms like Baseten, Fireworks AI, and Fal.
Trivia
-
The same H100 ranges from roughly $1.98/hr at DeepInfra and $2.50/hr-equivalent on Modal's per-second meter ($0.000694/sec) up to $7.00/hr on Fireworks AI's dedicated deployments — a 3.5x spread on identical silicon, driven almost entirely by how much managed inference software is wrapped around the GPU.
-
Most "GPU-hour" rates are actually billed per second. Modal publishes its H100 at $0.000694 per second and Replicate bills an A100 at $0.001525 per second — the hourly figure is just a readability convention, which matters because inference jobs often run for seconds, not hours.
-
RunPod publishes the widest single GPU rate card in the corpus, from a hobbyist RTX 4090 at $0.69/hr through a frontier B200 at $5.89/hr — an 8.5x range on one price list, split across a Secure Cloud / Community Cloud reliability tier.
Related billing units
- Credit-Based BillingA billing unit where customers pre-purchase or are allocated a pool of credits that deplete as they use the product, often at variable rates per feature.
- Token-Based PricingA billing unit common in LLM and AI products, where customers are charged per input and output token processed.
- Per-Seat PricingA billing unit where the vendor charges a fixed fee per named user, regardless of how much each user consumes.
- Per-Resolution PricingA billing unit unique to AI customer-support products, where the vendor charges only when an AI agent resolves a customer issue without escalation.
- Bandwidth-Based PricingA billing unit where customers are charged per gigabyte of data transferred out of the platform.
- Per-Function-Invocation PricingA billing unit where customers are charged per serverless function invocation, often combined with a separate compute-time charge.
- CPU-Hour PricingA billing unit where customers are charged for the CPU time their workloads consume, typically measured in vCPU-seconds or vCPU-hours.
- GB-Hour PricingA billing unit where customers are charged for the memory their workloads consume over time, measured in gigabyte-hours.
- Per-API-Call PricingA billing unit where customers are charged per API request, regardless of payload size or processing time.
- Per-GB Storage PricingA billing unit where customers are charged per gigabyte of data stored on the platform per month.
- Media-Minute PricingA billing unit where customers are charged per minute of audio or video processed — used by speech, voice, and video AI vendors.