How much does Together AI cost per month?

Together has no monthly subscription fee — you pay only for the serverless tokens, dedicated GPU hours, fine-tuning training tokens, and Code Sandbox usage you consume. A small RAG application using Llama 3.3 70B at 30M input + 10M output tokens would cost ~$42/month on serverless; the same workload on a dedicated H100 ($5.49/hr) running 4h/day would cost ~$660/month.

What are Together's serverless per-token rates?

Together publishes per-model rates inline on the pricing page. Sample rates per 1M tokens: Llama 3.3 70B at $1.04 input / $1.04 output; DeepSeek V4 Pro at $1.74 input / $3.48 output ($0.20 cached input); Qwen3.5 9B at $0.17 input / $0.25 output; GLM-5.1 at $1.40 input / $4.40 output. Image generation: FLUX.2 [dev] at $0.0154/image, FLUX.1 [schnell] at $0.0027/image, Stable Diffusion 3 at $0.0019/image.

What are Together's GPU rates for dedicated endpoints and clusters?

Dedicated inference (per GPU per hour, on-demand): HGX H100 at $5.49/hr, HGX B200 at $8.99/hr (H200/B300/GB200/GB300 quoted "Contact us"; all reserved dedicated capacity is "Contact sales"). On-demand GPU clusters: HGX H100 at $3.99/hr, HGX H200 at $5.99/hr, HGX B200 at $8.19/hr. Reserved clusters (7–30 day commits): H100 at $3.59/hr, H200 at $4.99/hr, B200 at $7.99/hr, dropping to $3.09/hr H100 on 91–180 day reservations — the reserved rates are among the lowest published in managed inference. Separately, a July 2026 Provisioned Throughput (PTU) SKU reserves capacity in throughput units at $0.05 per PTU-minute (MiniMax M3, GLM-5.2) for buyers who prefer a fixed tokens-per-minute envelope over per-GPU-hour rentals.

Does Together AI have a free tier?

New accounts can start without an upfront commitment, but Together does not publish a specific signup-credit dollar amount on its pricing page or quickstart docs. Calling paid serverless and image models requires a positive credit balance, and production usage requires a payment method on file. There is no permanent free tier.

How does Together's fine-tuning pricing work?

Fine-tuning is priced per 1M training tokens (LoRA vs full-parameter). Standard tier: up to 16B at $0.48 SFT LoRA / $1.20 full; 17–69B at $1.50 / $3.75; 70–100B at $2.90 / $7.25. A specialized per-model tier covers frontier architectures (DeepSeek-R1 $10 SFT LoRA, GLM-5 $40, Qwen3.5-397B $8, Llama 4 Scout $3), ranging roughly $3–$40 SFT LoRA and up to $100 DPO LoRA per 1M tokens. The standard tier has no per-job minimum, while the specialized tier carries per-model minimum charges of $6–$60 (Qwen3-235B is the exception with no minimum).

What is Together's Code Sandbox and how is it priced?

Code Sandbox is a managed code-execution environment for agentic workflows. Billed at $0.0446 per vCPU-hour and $0.0149 per GiB-hour for the sandbox runtime. Code Interpreter (a higher-level managed session API) bills at $0.03/session. Storage attached to sandbox sessions is $0.16/GiB-month.

Together AI Pricing

AI Summary

Together AI runs a multi-SKU pure-usage cloud: per-token serverless inference for popular open-weight models (Llama 3.3 70B at $1.04/$1.04, DeepSeek V4 Pro at $1.74/$3.48 with $0.20 cached input, Qwen3.5 9B at $0.17/$0.25, GLM-5.1 at $1.40/$4.40), per-image generation (FLUX.2 [dev] $0.0154, FLUX.1 [schnell] $0.0027, Stable Diffusion 3 $0.0019), and per-hour dedicated and cluster GPUs.
Dedicated inference endpoints (restructured to on-demand vs reserved, priced per GPU per hour) at $5.49/hr on-demand HGX H100 and $8.99/hr HGX B200 (down from $6.49 and $11.95), with H200/B300/GB200/GB300 lines quoted "Contact us" and all reserved capacity "Contact sales"; on-demand GPU clusters at $3.99/hr H100, $5.99/hr H200, and $8.19/hr B200; reserved cluster rates (7–30 day commits) at $3.59/hr H100 and $7.99/hr B200, dropping to $3.09/hr H100 on 91–180 day reservations — among the lowest published in the market. Dedicated inference endpoints (restructured to on-demand vs reserved, priced per GPU per hour) at $5.49/hr on-demand HGX H100 and $8.99/hr HGX B200 (down from $6.49 and $11.95), with H200/B300/GB200/GB300 lines quoted "Contact us" and all reserved capacity "Contact sales"; on-demand GPU clusters at $3.99/hr H100, $5.99/hr H200, and $8.19/hr B200; reserved cluster rates (7–30 day commits) at $3.59/hr H100 and $7.99/hr B200, dropping to $3.09/hr H100 on 91–180 day reservations — among the lowest published in the market.
Fine-tuning priced per 1M training tokens (LoRA / full-parameter): up to 16B at $0.48 / $1.20 SFT; 17–69B at $1.50 / $3.75; 70–100B at $2.90 / $7.25. A specialized per-model tier (DeepSeek-R1, GLM-5, Qwen3.5, gpt-oss) runs SFT LoRA $3–$40 and DPO LoRA $7.50–$100 per 1M tokens with per-model minimum charges of $6–$60.
Batch API offers a flat 50% discount on most models; Code Sandbox at $0.0446/vCPU-hour and $0.0149/GiB-hour for agentic code execution; Code Interpreter at $0.03/session; storage at $0.16/GiB-month. A July 2026 Provisioned Throughput (PTU) SKU reserves dedicated capacity in throughput units billed per PTU-minute ($0.05/PTU-min on MiniMax M3 and GLM-5.2), sized via an on-page calculator that estimates cost vs. commercial-model list prices assuming 24/7 provisioning. Batch API offers a flat 50% discount on most models; Code Sandbox at $0.0446/vCPU-hour and $0.0149/GiB-hour for agentic code execution; Code Interpreter at $0.03/session; storage at $0.16/GiB-month. A July 2026 Provisioned Throughput (PTU) SKU reserves dedicated capacity in throughput units billed per PTU-minute ($0.05/PTU-min on MiniMax M3 and GLM-5.2), sized via an on-page calculator that estimates cost vs. commercial-model list prices assuming 24/7 provisioning.
Founders include Stanford CRFM director Percy Liang and Stanford ML researcher Chris Re — making Together the rare commercial cloud with top-tier academic-lab architecture credibility on top of standard founder-CEO leadership.
Together raised a $305M Series B in February 2025 led by General Catalyst at $3.3B post-money; NVIDIA, Salesforce Ventures, and others participated. Series C reported in late 2025 at $5B+ valuation.

Pricing summary

Together AI 2026 — Multi-SKU AI Acceleration Cloud

Serverless tokens + dedicated endpoints + on-demand/reserved clusters + Code Sandbox; 50% Batch discount

Free trial

Evaluating Together for proof-of-value

Pay-as-you-go

Per token (varies)

Variable-traffic AI applications

Annual commit

Enterprise

Custom

Sustained workloads, regulated industries

Dedicated endpoints

$5.49 /hr (on-demand H100)

Single-tenant per-GPU-hour inference

GPU clusters

From $3.59 /hr (reserved H100)

Training and large-batch inference

New

Provisioned Throughput

$0.05 /PTU-minute

Reserved capacity in throughput units (PTUs)

No monthly fee. Dedicated endpoints were restructured and cut on 2026-07-14 (on-demand H100 now $5.49/hr, B200 $8.99/hr) and a Provisioned Throughput (PTU) SKU launched at $0.05/PTU-min. Reserved cluster rates require 7–30 day commits (as low as $3.09/hr H100 on a 91–180 day reserve). Batch API 50% discount stacks with neither cached input nor reserved rates. Code Sandbox bills per vCPU-hour and GiB-hour separately.

About

Together AI is a San Francisco-based generative AI cloud company founded in June 2022 by Vipul Ved Prakash (ex-Topsy CEO and Cloudmark founder), Ce Zhang (then ETH Zurich systems professor, now at the University of Chicago), Chris Re (Stanford ML and Snorkel co-founder), and Percy Liang (Stanford CRFM director). The product is an AI Acceleration Cloud — a managed inference, training, and code-execution platform optimized for open-source models and customer-fine-tuned variants — combining per-token serverless inference, per-hour dedicated endpoints, per-hour GPU clusters (on-demand and reserved), Code Sandbox / Code Interpreter for agentic workflows, and a fine-tuning service. The runtime is built on Together’s proprietary Together Inference Engine with FlashAttention-3 kernels and speculative decoding pipelines.

By 2026 Together serves Salesforce, Zoom, Pika Labs, Hippocratic AI, Cartesia, Arc Institute, and roughly 1,500 other paying customers spanning enterprise AI infrastructure (RAG systems, multi-tenant fine-tunes, large-batch inference), academic research labs running open-source training, and AI-native startups serving production workloads. The company raised a $305M Series B in February 2025 led by General Catalyst at a $3.3B post-money valuation with NVIDIA, Salesforce Ventures, Coatue, and Kleiner Perkins participation; a Series C reported in late 2025 brought valuation past $5B.

Together competes with Fireworks AI, Baseten, Replicate, Anyscale, and Groq for the managed-inference market, plus hyperscaler offerings (AWS Bedrock, Vertex AI, Azure ML). Its differentiation is the combination of academic-lab founder credibility (Stanford CRFM + Stanford ML), one of the broadest open-source model catalogs in the industry, aggressive reserved cluster pricing (H100 at $3.59/hr is among the lowest published rates), and Code Sandbox as a non-token SKU that captures agentic code-execution workloads without forcing customers onto third-party sandbox providers.

Pricing summary : How Together’s multi-SKU AI Acceleration Cloud is priced

Together runs five parallel pricing surfaces on a unified credits balance. Serverless inference charges per million input/output tokens by model, with per-model rates published inline on the pricing page (rare among competitors who route to docs). Provisioned Throughput (PTU) — added July 2026 — reserves dedicated capacity in throughput units billed per PTU-minute ($0.05/PTU-min on MiniMax M3 and GLM-5.2), sized via an on-page calculator that estimates monthly cost against commercial-model list prices. Dedicated endpoints are single-tenant per-GPU-per-hour rentals at $5.49/hr on-demand H100 and $8.99/hr on-demand B200 (reserved capacity is “Contact sales”), optimized for sustained-QPS workloads. GPU clusters are multi-node per-hour rentals for training and large-batch inference at $3.99/hr on-demand H100 and $3.59/hr reserved H100 (7–30 day commit). Code Sandbox and Code Interpreter bill per vCPU-hour, GiB-hour, and per-session for agentic code execution.

A Batch API offers a flat 50% discount on most serverless models for asynchronous workloads. Fine-tuning is priced per 1M training tokens by model size and method, with a specialized tier ($3–$40 SFT LoRA, up to $100 DPO LoRA per 1M, plus $6–$60 per-model minimum charges) for frontier architectures like DeepSeek-R1 and GLM-5. Enterprise commitments unlock volume discounts on top of reserved cluster rates and enable VPC deployment, custom SLAs, and dedicated solutions engineering. This multi-SKU pure-usage architecture — token / PTU-minute / image / GPU-hour / vCPU-hour — is one of the most expansive usage-based rate cards in AI infrastructure.

What makes this different: Reserved cluster pricing at $3.59/hr H100 (7–30 day commit) — and as low as $3.09/hr on a 91–180 day reservation — sits well below typical on-demand H100 rates from peers like Fireworks AI and Baseten. Together accepts a higher utilization risk (customer commits 7–30 days regardless of usage) in exchange for delivering lower per-hour cost — a structural choice that captures large-batch training and inference customers who can guarantee sustained utilization.

Pricing by product

Serverless inference (per-token, chat models)

Model	Input ($/1M)	Output ($/1M)	Cached input ($/1M)
DeepSeek V4 Pro	$1.74	$3.48	$0.20
GLM-5.2	$1.40	$4.40	$0.26
GLM-5.1	$1.40	$4.40	$0.26
Kimi K2.6	$1.20	$4.50	$0.20
Llama 3.3 70B	$1.04	$1.04	—
MiniMax M3	$0.30	$1.20	$0.06
Qwen3.5 9B	$0.17	$0.25	—
gpt-oss-120B	$0.15	$0.60	—

Image generation (per image)

Model	Rate
FLUX.2 [dev]	$0.0154
FLUX.1 [schnell]	$0.0027
SD XL	$0.0019

Provisioned Throughput (PTU) — reserved capacity in throughput units

New in July 2026. PTUs reserve dedicated capacity billed per PTU-minute; each PTU delivers a fixed, model-specific tokens-per-minute (TPM) rate that differs for input, cached, and output tokens. An on-page calculator (“Estimate your PTUs & cost”) sizes the PTUs required for a target traffic profile and estimates monthly cost and savings vs. a selected commercial model’s list price (assuming continuous 24/7 provisioning, ~43,800 min/mo).

Model	Input TPM/PTU	Cached TPM/PTU	Output TPM/PTU	Price ($/PTU-min)
MiniMax M3	138,840	694,200	23,140	$0.05
GLM-5.2	35,731	192,400	9,620	$0.05

Dedicated endpoints (single-tenant, per GPU per hour)

Restructured 2026-07-14 into on-demand (pay-as-you-go) vs reserved (Contact sales) columns, with a per-GPU-per-hour cut on the priced lines (on-demand H100 now $5.49/hr, B200 $8.99/hr):

Hardware	On-demand (PAYG)	Reserved
NVIDIA HGX H100	$5.49	Contact sales
NVIDIA HGX H200	Contact us	Contact sales
NVIDIA HGX B200	$8.99	Contact sales
NVIDIA HGX B300	Contact us	Contact sales
NVIDIA GB200 NVL72	Contact us	Contact sales
NVIDIA GB300 NVL72	Contact us	Contact sales

GPU clusters (multi-node)

On-demand (per hour):

Hardware	Hourly rate
NVIDIA HGX H100	$3.99
NVIDIA HGX H200	$5.99
NVIDIA HGX B200	$8.19

Reserved — rate steps down with longer reservation (minimum 6 days):

Hardware	7–30 days	31–90 days	91–180 days	181+ days
NVIDIA HGX H100	$3.59	$3.29	$3.09	Contact us
NVIDIA HGX H200	$4.99	$4.15	$3.99	Contact us
NVIDIA HGX B200	$7.99	$7.79	$6.79	Contact us
NVIDIA GB200 NVL72	Contact us	Contact us	Contact us	Contact us
NVIDIA GB300 NVL72	Contact us	Contact us	Contact us	Contact us

Fine-tuning — standard tier (per 1M training tokens)

Base model size	SFT LoRA	SFT full	DPO LoRA	DPO full
Up to 16B	$0.48	$0.54	$1.20	$1.35
17B – 69B	$1.50	$1.65	$3.75	$4.12
70B – 100B	$2.90	$3.20	$7.25	$8.00

Each standard fine-tuning job is subject to a minimum charge of $4.00.

Fine-tuning — specialized per-model tier (per 1M training tokens)

Frontier architectures are priced per model on a separate “Specialized” tab (SFT LoRA / DPO LoRA / per-job minimum charge shown):

Model	SFT LoRA	DPO LoRA	Minimum charge
Llama 4 Scout	$3.00	$7.50	$6.00
gpt-oss-120B	$5.00	$12.50	$6.00
Qwen3-235B-A22B	$6.00	$15.00	No min. price
Qwen3.5-397B-A17B	$8.00	$20.00	$22.00
GLM-4.6 / GLM-4.7	$9.00	$22.50	$27.00
DeepSeek-R1 / V3	$10.00	$25.00	$20.00
Kimi K2	$15.00	$37.50	$60.00
GLM-5 / GLM-5.1	$40.00	$100.00	$60.00

Specialized rates span roughly $3–$40 SFT LoRA and $7.50–$100 DPO LoRA per 1M tokens, with per-model minimum charges ($6–$60) on most frontier models (Qwen3-235B is the exception at “No min. price”).

Code Sandbox + Code Interpreter

Resource	Rate
Code Sandbox (vCPU)	$0.0446/vCPU-hour
Code Sandbox (memory)	$0.0149/GiB-hour
Code Interpreter (session)	$0.03/session
Storage (sandbox or model)	$0.16/GiB-month

Sales motions across products: PLG / self-serve for serverless, on-demand clusters, Provisioned Throughput sizing, and Code Sandbox; sales-led for reserved dedicated/cluster capacity, Enterprise annual contracts, and VPC deployments.

Hidden costs : What Together AI customers actually pay beyond the rate card

Archetype A: AI-native startup running Llama 3.3 70B serverless with bursty traffic

A growth-stage AI assistant startup serving ~75K requests/day, average 1.5K input + 400 output tokens, with traffic concentrated in business hours:

Line item	Monthly cost
Input tokens (3.4M/day × 30 = 101M, Llama 70B at $1.04/1M)	$105
Output tokens (900K/day × 30 = 27M, Llama 70B at $1.04/1M)	$28
Batch API for nightly summarization workflows (10M tokens, -50%)	$5
Code Interpreter for occasional agent execution (300 sessions × $0.03)	$9
Estimated total	~$147/month

For bursty traffic without sustained QPS, serverless dominates and the bill is dominated by per-token cost. Moving to a dedicated H100 endpoint ($5.49/hr on-demand, cut from $6.49 on 2026-07-14) would cost ~$4,000/month — only economical if sustained QPS rises above ~4 req/sec.

Archetype B: Mid-market team running a Llama 70B fine-tune on reserved H100 cluster

A team that fine-tuned Llama 3.3 70B (full-parameter SFT, 25M training tokens) and runs sustained inference on a reserved H100 cluster:

Line item	Monthly cost
Initial fine-tuning (one-time, 25M tokens × $3.75)	$94
Reserved H100 (24h × 30 × $3.59/hr)	$2,585
Storage for model artifacts + sandbox (50GiB × $0.16)	$8
Code Sandbox for agent execution (200 vCPU-hours × $0.0446)	$9
Estimated total	~$2,700/month (after one-time $94 fine-tune)

Reserved cluster pricing dominates the bill at sustained utilization — and the $3.59/hr H100 reserved rate (dropping to $3.09/hr on a 91–180 day reservation) makes Together one of the cheapest published managed-inference platforms for training and large-batch workloads. The trade-off is the reservation commit: even idle hours cost the customer.

Want to estimate your own Together AI bill? Use the Together AI pricing calculator to model serverless tokens, dedicated GPU hours, reserved cluster commits, and Code Sandbox costs.

Pricing evolution : Together’s pricing history from decentralized GPU pooling to AI Acceleration Cloud

Cadence

Quarter	Price changes	Product / SKU additions	Notes
2022 Q2	0	1	Together founded; decentralized GPU pooling product
2023 Q4	0	1	Inference Cloud GA + Series A ($102.5M)
2024 Q1	0	1	Dedicated endpoints + fine-tuning launched
2024 Q3	1	1	GPU Clusters launched at $5.49/hr on-demand H100
2025 Q1	0	0	Series B ($305M) at $3.3B valuation
2025 Q2	0	1	Batch API + Code Sandbox + Code Interpreter
2025 Q4	0	1	FLUX.2 + FLUX-schnell + Stable Diffusion 3 image SKUs
2026 Q1	1	0	Specialized fine-tuning tier (DeepSeek-R1, GLM-5) at $10–$100+
2026 Q2	2	1	2026-06-24 broad serverless re-pricing + GPU cluster rate cuts (on-demand H100 $5.49→$4.79, reserved 7–30d H100 $4.99→$4.19); cached-input rates first published; HGX H200 cluster line added; 2026-06-30 second GPU cluster cut in a week (on-demand H100 $4.79→$3.99, reserved 7–30d H100 $4.19→$3.59, 91–180d floor $3.29→$3.09); 1× H200 140GB dedicated line added; standard fine-tuning $4.00 per-job minimum stated
2026 Q3	2	1	2026-07-14 Provisioned Throughput (PTU) launched at $0.05/PTU-min (MiniMax M3, GLM-5.2) with an on-page sizing calculator; Dedicated Inference restructured from per-instance to a per-GPU-per-hour on-demand-vs-reserved grid and cut (on-demand H100 $6.49→$5.49, B200 $11.95→$8.99); HGX H200/B300 + GB200/GB300 NVL72 dedicated lines added (Contact us), all reserved dedicated capacity moved to Contact sales; Series C funding banner appeared

Tracked range: 2022 Q2–2026 Q3. Quarters not listed above were verified stable (0 price changes, 0 SKU additions).

Notable changes

2023-11-29 — Inference Cloud GA with per-token serverless API; established Together as a Cloudflare-for-LLM-inference contender.
2024-03-12 — Dedicated endpoints + fine-tuning launched; expanded from single-SKU per-token to multi-SKU platform.
2024-09-20 — GPU Clusters launched at $5.49/hr on-demand H100 and $4.99/hr reserved (7–30 day commit); some of the lowest published H100 rates in managed inference.
2025-06-18 — Batch API at 50% discount launched; Code Sandbox + Code Interpreter added non-token SKUs to the rate card.
2025-10-08 — FLUX.2 + FLUX-schnell + Stable Diffusion 3 image generation SKUs launched at per-image rates.
2026-01-15 — Specialized fine-tuning tier launched for DeepSeek-R1, GLM-5, and other large-context frontier architectures; reflected higher infrastructure cost of training on newer architectures.
2026-06-24 — Broad serverless re-pricing and GPU cluster rate cuts: on-demand cluster H100 $5.49→$4.79/hr and B200 $9.95→$8.19/hr; reserved 7–30 day H100 $4.99→$4.19/hr and B200 $9.65→$7.99/hr (H100 as low as $3.29/hr on a 91–180 day reserve); an HGX H200 line was added at $5.99/hr on-demand. On serverless, DeepSeek V4 Pro fell to $1.74/$3.48 while Qwen3.5 9B and Llama 3.3 70B rose; cached-input rates (e.g. GLM-5.1/5.2 $0.26, DeepSeek V4 Pro / Kimi K2.6 $0.20) were published for the first time, and the specialized fine-tuning tier began carrying $6–$60 per-model minimum charges.
2026-06-30 — A second GPU cluster rate cut within the same week: on-demand HGX H100 $4.79→$3.99/hr and reserved H100 stepping down across every window (7–30 day $4.19→$3.59, 31–90 day $3.45→$3.29, 91–180 day $3.29→$3.09), making $3.09/hr the new published reserved-H100 floor. On-demand and reserved H200/B200 rates were unchanged. A new 1× H200 140GB dedicated-endpoint line appeared (priced “Contact us”), and the standard fine-tuning tier began stating a $4.00 per-job minimum charge.
2026-07-14 — Provisioned Throughput (PTU) launched — a fifth pricing surface that reserves dedicated capacity in throughput units at $0.05/PTU-minute (MiniMax M3, GLM-5.2), sized via an on-page calculator that estimates monthly cost and savings vs. commercial-model list prices. In the same update Dedicated Inference was restructured from a per-instance table to a per-GPU-per-hour on-demand-vs-reserved grid and cut: on-demand HGX H100 $6.49→$5.49/hr (−15%) and HGX B200 $11.95→$8.99/hr (−25%), with HGX H200/B300 and GB200/GB300 NVL72 lines added (quoted “Contact us”) and all reserved dedicated capacity moved to “Contact sales”. Serverless per-token and GPU cluster rates were unchanged. A Series C funding banner also appeared.

The June 2026 repricing in detail

The two June 2026 moves — 2026-06-24 and 2026-06-30, a week apart — are the first broad rate-card changes since GPU Clusters launched in 2024, and together they sharpen Together’s cost-leadership position rather than reversing it. Across the two cuts the reserved 7–30 day H100 rate fell from $4.99 to $3.59/hr and the 91–180 day floor from “$3.29 (post-launch)” down to $3.09/hr, while on-demand H100 dropped from $5.49 to $3.99/hr — pushing Together’s managed-Hopper rate further below typical peer on-demand pricing, so the structural advantage this page has tracked since launch only widened. The serverless re-pricing on the 24th mixed cuts and raises: DeepSeek V4 Pro got materially cheaper ($2.10/$4.40 → $1.74/$3.48) while the small-model floor (Qwen3.5 9B, Llama 3.3 70B) rose modestly, a re-rating toward the heavier-traffic reasoning models.

Three of the moves’ effects close gaps this page previously flagged as weaknesses. First, cached-input pricing is now published on much of the catalog (GLM-5.1/5.2 $0.26, DeepSeek V4 Pro and Kimi K2.6 $0.20, MiniMax M3 $0.06) — so any prior claim that Together shipped “no cached-input discount” is no longer true; the remaining gap is coverage (Llama 3.3 70B and gpt-oss still carry no cached rate), not absence. Second, the specialized fine-tuning tier now states $6–$60 per-model minimum charges, and the 06-30 cut added a $4.00 per-job minimum on the standard fine-tuning tier — reversing the earlier “no per-job minimum” claim across both tiers, so even routine fine-tunes now carry a small floor finance teams must model. Third, the two back-to-back GPU cuts signal that Together is racing the managed-Hopper rate down rather than holding it: $3.09/hr reserved H100 is roughly a 38% cut from the $4.99/hr 2024 launch reserve in under two years.

The July 2026 PTU launch in detail

The 2026-07-14 update is the first structural rate-card change since 2024 rather than another rate cut — it adds a fifth billing dimension (the PTU-minute) and re-shapes how dedicated capacity is sold. Provisioned Throughput answers the one gap Together’s pure-usage model left open for high-volume production buyers: per-token serverless bills bounce with traffic, and per-GPU-hour dedicated forces the customer to reason in hardware rather than throughput. A PTU abstracts both away — the customer reserves a fixed tokens-per-minute envelope ($0.05/PTU-minute on MiniMax M3 and GLM-5.2) and the on-page calculator translates a traffic profile into PTUs and a monthly cost, benchmarked against a commercial model’s list price. That framing — reserved throughput sold against the list price of a closed frontier model — is a direct play for buyers weighing an open-weight deployment versus an OpenAI/Anthropic API bill, and it is the clearest example yet of Together packaging its cost advantage as a budgeting story rather than a raw rate.

The simultaneous Dedicated Inference restructure sharpens rather than reverses the cost-leadership thesis this page has tracked: on-demand HGX H100 fell 15% ($6.49→$5.49/hr) and HGX B200 25% ($11.95→$8.99/hr), and the per-instance table became an on-demand-vs-reserved grid that pushes every serious commitment (“Contact sales”) and every next-gen part (H200, B300, GB200/GB300 NVL72 — “Contact us”) into a sales conversation. So the transparency this page praises on serverless thins on the newest dedicated hardware: the headline H100/B200 rates stay public, but Blackwell-Ultra and Grace-Blackwell capacity is now quote-only — a deliberate trade of published-rate breadth for sales-qualified enterprise capacity as the newest GPUs come online.

What’s unique : Together AI’s distinctive pricing mechanics

1. Per-model serverless rates published inline on the pricing page. Most inference middleware (Fireworks, Baseten) lists discount mechanics on the pricing page but routes to docs for per-model rates. Together’s inline display lets self-serve buyers compare model economics side-by-side without context switching — a pricing transparency UX advantage that materially reduces evaluation friction.

2. Reserved cluster pricing ($3.59/hr H100, dropping to a $3.09/hr floor) for short commits. Most platforms offer either on-demand (high price, no commit) or annual commits (lowest price, year-long lock-in). Together’s 7–30 day reserved tier creates a middle path: customers commit a week to a month at $3.59/hr — and step down to a $3.09/hr floor on a 91–180 day reserve — capturing a meaningful discount over the $3.99/hr on-demand rate without annual lock-in. This commitment-flexibility innovation captures large-batch training customers who would balk at annual commits.

3. Code Sandbox as a non-token agentic SKU. Code Sandbox bills per vCPU-hour and GiB-hour — a fundamentally different metric than tokens — and Code Interpreter bills per session. Adding non-token SKUs to a token-dominated rate card lets Together capture agentic code-execution workloads without forcing customers onto third-party sandbox providers (E2B, Modal). The unified billing reduces vendor count for AI-native teams building autonomous agents.

4. Academic-lab founder credibility (Stanford CRFM + Stanford ML). Together’s co-founders include Stanford CRFM director Percy Liang and Stanford ML researcher Chris Re — making the platform’s optimization claims and model curation credible in a way that pure-engineering teams cannot replicate. The CRFM Helm leaderboard, the academic stewardship of open-source models, and the Together rate card share a knowledge base.

5. Multi-mode cluster pricing (on-demand + reserved) at different commit windows. Most clusters force a single mode choice (on-demand-only or reserved-only); Together’s three-tier structure (on-demand, 7–30 day reserved, Enterprise annual commit) lets customers match commit duration to workload predictability. This granular commitment design accommodates training cycles (weeks) and steady production (months) without forcing one model to fit both.

6. Provisioned Throughput (PTU) — reserved capacity sold in throughput units, benchmarked against commercial-model list prices. Launched July 2026, PTUs bill per PTU-minute ($0.05 on MiniMax M3 and GLM-5.2) for a fixed, model-specific tokens-per-minute envelope, and an on-page calculator sizes the reservation and estimates savings against a selected commercial model’s list price. Applying the Azure-OpenAI-style provisioned-throughput unit to open-weight models gives high-volume buyers a predictable-cost alternative to per-token variability — and frames the pitch as open-weight-vs-closed-frontier economics rather than raw GPU-hours.

Strengths & weaknesses

Strengths	Weaknesses
Per-model serverless rates published inline — best transparency in the category	Per-model rates require reading a long inline table; no comparison filter
Reserved cluster H100 at $3.59/hr (down to a $3.09/hr floor) is among the lowest published in managed inference	Reserved commits require a 7–30 day duration — idle hours still billed
Academic-lab founder credibility (CRFM + Stanford ML)	Code Sandbox vCPU-hour rates need separate cost-modeling alongside token spend
Code Sandbox + Code Interpreter capture agentic code-execution without third-party tools	Cached-input rates published on many but not all serverless models (Llama 3.3 70B, gpt-oss have none)
FLUX.2 + FLUX-schnell + SD3 image SKUs unified in rate card	A100 not prominently listed — A100 capacity available but not on the headline rate card
Multi-mode cluster pricing (on-demand, 7–30 day reserved, annual) accommodates many workload types	Specialized fine-tuning tier (up to $40 SFT LoRA / $100 DPO LoRA per 1M) prices frontier-model tuning well above the standard tier
Provisioned Throughput (PTU) gives steady high-volume workloads predictable per-PTU-minute cost, with a calculator that quantifies savings vs commercial-model list prices	PTU launched on only two models (MiniMax M3, GLM-5.2), and its savings estimate assumes 24/7 provisioning — idle throughput is still billed; next-gen dedicated GPUs (H200, B300, GB200/GB300) are now quote-only

Billing UX : Together AI’s account controls and payment experience

Self-serve signup — Sign up at api.together.ai with email; trial credits applied automatically. Credit card required for production usage.
Unified credits balance — Serverless tokens, dedicated GPU hours, GPU cluster hours, fine-tuning training tokens, and Code Sandbox usage all bill against the same workspace credits balance.
Per-request usage metadata — API responses include input tokens, output tokens, and per-request cost so client applications can compute and surface real-time cost.
Per-model rate visibility — Pricing page displays per-model rates inline; dashboard shows live consumption per model and per SKU.
Spend alerts — Configurable email and webhook alerts at $X spend per period.
Payment methods — Credit card and ACH on self-serve; wire transfer, invoice billing, and AWS/GCP Marketplace on Enterprise.
PTU sizing calculator — The pricing page’s “Estimate your PTUs & cost” tool takes model, comparison model, peak requests/sec, cache-hit rate, and input/output tokens per request, then returns PTUs required, estimated monthly cost, and estimated monthly savings vs. the selected commercial model’s list price (24/7 provisioning assumed).
Cluster reservation booking — 7–30 day GPU cluster reservations bookable directly via the dashboard with confirmed start dates; cancellation policies vary by SKU.
Audit logging + RBAC — Workspace-level RBAC on Pro+; SOC 2 audit-log exports on Enterprise via S3 or webhook.
Multi-region availability — US and EU regions standard for serverless; reserved clusters available in additional regions on Enterprise commitments.

Strategic wins : Why Together AI’s pricing decisions worked

1. Inline per-model rate publication removed evaluation friction

By publishing per-model rates on the pricing page rather than routing to docs, Together let self-serve buyers compare model economics in a single context. This transparency converts more self-serve customers and reduces sales-led overhead for low-value-deal segments. Most competitors’ docs-routing UX loses cost-sensitive evaluators who never get to the rate card before churning.

2. 7–30 day reserved cluster tier captured the middle of the commit-duration spectrum

Annual commits lock too much for many training and large-batch customers; on-demand is too expensive for sustained workloads. Together’s 7–30 day reserved tier created a middle option that converts customers who would otherwise self-build on raw cloud. The $3.59/hr H100 reserved rate — and a $3.09/hr floor on a 91–180 day reserve — is low enough to compete with hyperscaler EDP-discounted rates without forcing year-long commitments.

3. Code Sandbox as a non-token SKU expanded TAM beyond token-only buyers

Adding Code Sandbox ($0.0446/vCPU-hour) and Code Interpreter ($0.03/session) gave Together a SKU that captures agentic code-execution workloads — workloads that would otherwise go to E2B, Modal, or Pyodide. The unified billing balance reduces vendor count for AI-native teams building autonomous agents, locking in wallet share.

4. Academic-lab founder credibility as the platform-runtime trust anchor

Stanford CRFM director Percy Liang and Stanford ML researcher Chris Re as co-founders give Together unusual academic-lab credibility that customers extend to model curation, optimization claims, and platform design. For enterprise procurement leaders evaluating inference middleware, this founder profile distinguishes Together from pure-engineering teams in a way that is hard to replicate.

Areas to improve : Gaps in Together’s pricing approach

1. Cached-input discount is published on some, but not all, serverless models

As of June 2026 Together publishes cached-input rates on many serverless models (e.g. DeepSeek V4 Pro $0.20, GLM-5.1 $0.26, MiniMax M3 $0.06) — closing a gap it previously had versus Fireworks, OpenAI, Anthropic, and Baseten. But popular models like Llama 3.3 70B and gpt-oss still show no cached-input rate, so RAG and agent-loop workloads with high prefix re-use only benefit on a subset of the catalog. Extending cached-input discounting across the full model list would remove the remaining inconsistency.

2. Per-model rate comparison needs better filtering

The inline per-model rate table is comprehensive but long. Customers comparing 5–10 models must scroll and scan rather than filter. Adding a per-model filter / sort / search UI on the pricing page would convert more evaluation traffic into pilots without requiring API exploration.

3. Specialized fine-tuning tier (up to $40 SFT LoRA / $100 DPO LoRA per 1M) creates budget uncertainty

The wide per-model spread — from $3 SFT LoRA on Llama 4 Scout to $40 on GLM-5, and up to $100 DPO LoRA — makes it hard for finance teams to forecast fine-tuning budgets on DeepSeek-R1 or GLM-5 without reading the per-model table. The rates are published per model on a separate “Specialized” tab, so a per-model price calculator would further reduce friction for frontier-fine-tuning workloads that currently go to first-party providers.

4. A100 not on the headline rate card

The on-demand rate card lists H100 and B200 prominently but not A100. For non-frontier workloads that fit comfortably on A100, customers may compare Together’s headline H100 rate to a competitor’s published A100 rate and conclude Together is more expensive. Publishing an A100 rate (even at a “limited availability” disclaimer) would prevent unfavorable comparison.

5. Provisioned Throughput launched on only two models, and its calculator assumes 24/7 provisioning

The July 2026 PTU SKU covers only MiniMax M3 and GLM-5.2 at launch, so buyers who standardized on Llama, DeepSeek, or Qwen cannot yet reserve throughput. The sizing calculator also estimates savings assuming continuous 24/7 provisioning (~43,800 min/mo), which overstates the benefit for bursty or business-hours-only traffic where reserved throughput sits idle overnight. Extending PTU model coverage and adding a duty-cycle input to the calculator would make the savings estimate honest for non-continuous workloads and broaden PTU’s addressable base beyond always-on production endpoints.

6. Next-gen dedicated GPUs moved behind “Contact us” in the 2026-07-14 restructure

The Dedicated Inference restructure kept public rates on HGX H100 ($5.49/hr) and HGX B200 ($8.99/hr) but pushed HGX H200, HGX B300, and the GB200/GB300 NVL72 lines to “Contact us,” with all reserved dedicated capacity now “Contact sales.” That thins the published-rate transparency this page otherwise praises exactly where buyers are evaluating the newest Blackwell-Ultra and Grace-Blackwell hardware. Publishing at least indicative on-demand rates for the next-gen parts would preserve the self-serve legibility that differentiates Together on the rest of the rate card.

Monetization stack & signals : how Together AI builds & buys its revenue engine

Buys 5 Builds 0 10 open roles

The read — where the monetization investment is going

Together AI runs a bought monetization stack, not an in-house metering build — no engineering-blog disclosure of a home-grown billing/metering service surfaced, and the finance/data org instead names third-party tooling. A RevOps posting describes "our integrated technology stack, including Salesforce" (CRM, stated in-use, double-sourced), and two data-warehouse engineering roles own building and maintaining "dbt transformation projects" on the analytics warehouse (data-platform, stated in-use across two live reqs). An Infrastructure Accounting Manager role names the ERP as NetSuite in-use ("integrations between ERP (NetSuite), procurement, and asset tracking"), corroborating a separate Sr. Revenue Accountant req that lists NetSuite as preferred experience — so rev-rec on NetSuite is a stated, double-sourced signal. The same Sr. Revenue Accountant req lists Metronome (usage-based billing) and Stripe (payments) as preferred experience — strong but inferred signals (preferred-qual framing, not an in-use disclosure) that the usage-metered revenue runs through bought metering + payments rather than a custom meter. Hiring is concentrated in customer-success/solutions for GPU-cluster and inference accounts (10 open roles) plus a finance/billing-data-platform build-out (2 billing-eng, 2 data-platform), reflecting a self-serve-plus-sales-led GPU cloud scaling its revenue-data and post-sale support functions.

Stack — build vs buy

Buys (vendor) · 5

Salesforce CRM Job post 1 Job post 2 Jun 2026

“Oversee and optimize our integrated technology stack, including Salesforce, marketing automation (e.g., HubSpot), and sales engagement tools (e.g., MixMax)”
dbt Data platform Job post 1 Job post 2 Jun 2026

“Build and maintain Airflow orchestrated pipelines and dbt transformation projects (modular, tested, documented)”
Metronome Metering inferred Job post Jun 2026

“Experience with Metronome (usage-based billing) and Stripe (payments infrastructure)”
Stripe Payments inferred Job post Jun 2026

“Experience with Metronome (usage-based billing) and Stripe (payments infrastructure)”
NetSuite Revenue recognition Job post 1 Job post 2 Jun 2026

“Evaluate and enhance systems and integrations between ERP (NetSuite), procurement, and asset tracking tools to support rapid growth”

Open roles in the revenue & lifecycle org — 10

View open roles

Sr. Revenue Accountant Billing engineeringRevOps Jun 17, 2026
Senior Data Engineer Billing engineering Jun 17, 2026
Sales and Marketing Operations Manager RevOps Jun 17, 2026
Analytics Engineer — Data Warehouse Data platform Jun 17, 2026
Data Warehouse Engineer Data platform Jun 17, 2026
Customer Support Engineer (GPU Cluster) Customer success Jun 17, 2026
Customer Support Engineer (Inference) Customer success Jun 17, 2026
Technical Account Manager (TAM), GPU Cluster Customer success Jun 17, 2026
Solutions Architect (Inference) Customer success Jun 17, 2026
Forward Deployed Engineer (Inference & Post-Training) Customer success Jun 17, 2026
+5 more matched roles

Signals reviewed Jun 2026 · derived from public job posts

Job postings fill and close over time — once a posting is filled we keep it as a dated citation (the quoted evidence remains); use View open roles for current listings.

Key takeaways

Inline per-model rate publication beats docs-routing for self-serve conversion. Together’s pricing page transparency converts evaluators that competitors lose to context switching. Self-serve usage-based platforms should display per-SKU per-model rates directly on the pricing page rather than route to documentation.
Commitment-based capacity pricing (multi-window reserved clusters + PTU throughput reservation) captures more buyers than on-demand-or-annual binaries. The 7–30 day reserved cluster tier converts customers with training cycles that don’t fit either extreme — Together’s $3.59/hr reserved H100 (with a $3.09/hr floor on longer reserves) is among the lowest published in managed inference — and the July 2026 Provisioned Throughput SKU adds a second commitment axis, letting steady high-volume buyers reserve a fixed per-PTU-minute throughput envelope instead of reasoning in GPU-hours.
Non-token SKUs (Code Sandbox, Code Interpreter) expand TAM beyond token-only inference buyers. As agentic workflows scale, code-execution sandbox SKUs are becoming table stakes for inference platforms targeting AI-native teams.
Academic-lab founder credibility is a defensible trust anchor. Stanford CRFM and ML lab credentials extend customer trust from founder vision to model curation, optimization claims, and platform design — a trust multiplier competitors cannot replicate without acquiring similar talent.
Cached input discount is becoming table stakes for serverless inference. Together now publishes cached-input rates on many models (DeepSeek V4 Pro $0.20, GLM-5.1 $0.26) — closing a gap it previously had versus Fireworks, OpenAI, and Anthropic — though coverage is still partial (Llama 3.3 70B and gpt-oss carry no cached rate).

UBP implications

Pricing page transparency converts more self-serve revenue than docs-routing. Usage-based platforms should default to inline per-SKU per-model rate display. Docs-routing is acceptable for advanced SKUs (specialized fine-tuning, custom enterprise terms) but should not be the default for the top-traffic surfaces.
Multi-window commitment pricing — plus throughput-unit reservation — captures buyer segments that on-demand-or-annual binaries miss. The 7–30 day reserved tier is the canonical structure for training and large-batch workloads where annual commits over-allocate and on-demand under-allocates; Together’s July 2026 PTU-minute SKU extends the same logic to steady inference, letting buyers reserve a fixed throughput envelope benchmarked against a closed-frontier API’s list price. The design’s honesty depends on the sizing calculator modeling real duty cycles, not only continuous 24/7 provisioning.
Non-token usage SKUs (vCPU-hour, GiB-hour, per-session) are becoming necessary for inference platforms targeting agentic workflows. Token-only rate cards leave code-execution and sandbox workloads on the table for third-party vendors that customers prefer to consolidate.

Sources

Together AI pricing page (accessed 2026-07-14)
Together AI docs — serverless models (accessed 2026-07-14)
Together AI GPU Clusters pricing (accessed 2026-06-30)
Together AI docs — dedicated inference pricing (accessed 2026-07-14)
Together AI docs — fine-tuning pricing (accessed 2026-06-24)
Together blog — Series B announcement (accessed 2026-05-29)
Together blog — Code Sandbox launch (accessed 2026-05-29)
Together AI model catalog (accessed 2026-05-29)
Related infra blueprint — Fireworks AI
Related infra blueprint — Baseten
Blueprint corpus index

Bottom line

Together AI priced its AI Acceleration Cloud around four structural ideas: inline per-model rate publication on the pricing page (best transparency in the category), aggressive reserved cluster pricing at $3.59/hr H100 (down to a $3.09/hr floor) with a 7–30 day commit window (lowest published in managed inference), non-token SKUs (Code Sandbox, Code Interpreter) that capture agentic workloads without third-party vendors, and Stanford CRFM + Stanford ML founder credibility that distinguishes Together from pure-engineering platforms. The multi-mode cluster pricing (on-demand, 7–30 day reserved, annual commit) and five-SKU rate card (token / PTU-minute / image / GPU-hour / vCPU-hour, after the July 2026 Provisioned Throughput launch) make Together one of the most expansive usage-based platforms in AI infrastructure.

For AI engineering teams running training cycles, large-batch inference, and agentic code execution at scale, Together is the most legible commercial platform — and the $3.59/hr reserved H100 rate (cut twice within a week in June 2026, down to a $3.09/hr floor on longer reserves) is itself a structural cost advantage. The remaining gaps (partial cached-input coverage on serverless, no A100 on the headline rate card, specialized fine-tuning tier budget uncertainty) are competitive parity issues rather than structural pricing flaws.

Compare with peers via the blueprint corpus, or model your own spend with the Together AI pricing calculator.

Pricing timeline : Major events on a vertical axis

Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.

Provisioned Throughput (PTU) launch + Dedicated Inference restructure & price cuts

Jul 2026

Together launched Provisioned Throughput — a new SKU that reserves dedicated capacity in throughput units (PTUs) billed per PTU-minute ($0.05/PTU-min on MiniMax M3 and GLM-5.2), with an on-page calculator that sizes PTUs and estimates monthly cost vs. commercial-model list prices. In the same update, Dedicated Inference was restructured to a per-GPU-per-hour table split into on-demand (pay-as-you-go) vs reserved (Contact sales) columns: on-demand HGX H100 fell to $5.49/hr (from $6.49) and HGX B200 to $8.99/hr (from $11.95), and NVIDIA HGX H200, HGX B300, GB200 NVL72, and GB300 NVL72 lines were added (quoted "Contact us"). GPU Cluster and serverless rates were unchanged. The pricing page also carries a new Series C funding banner.

captured 2026-07-14

Reserved + on-demand GPU cluster rate cut (H100 down 12–16%)

Jun 2026

Together cut its GPU Cluster rates again. On-demand HGX H100 fell to $3.99/hr (from $4.79); reserved 7–30 day H100 fell to $3.59/hr (from $4.19), 31–90 day to $3.29 (from $3.45), and 91–180 day to $3.09/hr (from $3.29) — making the reserved H100 floor $3.09/hr. On-demand H200/B200 and reserved H200/B200 rates were unchanged. A 1× H200 140GB dedicated-endpoint line was added (priced "Contact us"). The standard fine-tuning tier now states a $4.00 per-job minimum charge.

captured 2026-06-30

Serverless re-pricing + GPU cluster rate cuts + cached input

Jun 2026

Together repriced its serverless rate card and cut GPU cluster rates. DeepSeek V4 Pro dropped to $1.74/$3.48 (from $2.10/$4.40) and now shows a $0.20 cached-input rate; Qwen3.5 9B rose to $0.17/$0.25 (from $0.10/$0.15); Llama 3.3 70B rose to $1.04/$1.04 (from $0.88/$0.88). On-demand cluster H100 fell to $4.79/hr (from $5.49) and B200 to $8.19/hr (from $9.95); reserved 7–30 day H100 fell to $4.19/hr (from $4.99) and B200 to $7.99/hr (from $9.65), with H100 as low as $3.29/hr on a 91–180 day reservation. Cached-input pricing is now published on serverless models.

captured 2026-06-24

Specialized Model Fine-Tuning Tier

Jan 2026

Together added a specialized model fine-tuning tier for DeepSeek-R1, GLM-5, and other large-context models at $10–$100+ per 1M training tokens with $20–$60 per-job minimum charges. Reflected the higher infrastructure cost of training on the latest frontier architectures.

Specialized Model Fine-Tuning Tier screenshot 1

Specialized Model Fine-Tuning Tier screenshot 2

FLUX.2 Image Generation at $0.0154/image

Oct 2025

Together added FLUX.2 [dev] image generation at $0.0154/image, FLUX.1 [schnell] at $0.0027/image, and Stable Diffusion 3 at $0.0019/image. Per-image pricing positioned alongside per-token text inference as a unified rate card.

Batch API + Code Sandbox Launched

Jun 2025

Together launched Batch API (50% discount on most models) for asynchronous inference workloads, and Code Sandbox ($0.0446/vCPU-hour, $0.0149/GiB-hour) for agentic code execution. Code Interpreter at $0.03/session added a session-billed SKU to the rate card.

Series B ($305M) at $3.3B Valuation

Feb 2025

Together raised a $305M Series B led by General Catalyst at a $3.3B post-money valuation. NVIDIA, Salesforce Ventures, Coatue, and others participated. The round funded Code Sandbox, FLUX image generation, and the Together AI Acceleration Cloud rebrand.

GPU Clusters at $5.49/hr H100 On-Demand

Sep 2024

Together launched GPU Clusters — on-demand multi-GPU rentals for training and large-batch inference. Pricing at $5.49/hr H100 on-demand and $9.95/hr B200 undercut Fireworks' and Baseten's dedicated rates substantially. Reserved 7–30 day commitments dropped the H100 rate to $4.99/hr.

Dedicated Endpoints + Fine-Tuning Launched

Mar 2024

Together added dedicated single-tenant endpoints (per-hour H100, A100) and a fine-tuning service. Established the multi-SKU architecture: serverless per-token + dedicated per-hour + fine-tuning per training-token that remains the canonical structure today.

Series A ($102.5M) + Inference Cloud GA

Nov 2023

Together raised a $102.5M Series A led by Kleiner Perkins with NEA, Lux, and others. Inference Cloud went GA with per-token serverless API for Llama 2, Falcon, Code Llama, and Stable Diffusion. Initial pricing was a flat per-million-token rate by model class.

Together Founded

Jun 2022

Vipul Ved Prakash (ex-Topsy, Cloudmark) co-founded Together with Ce Zhang (ETH Zurich), Chris Re (Stanford), and Percy Liang (Stanford CRFM director). Initial product was decentralized GPU pooling for open-source model training, evolving rapidly into a managed inference cloud through 2023.

Trivia

· Together AI's $3.59/hr H100 reserved rate (7–30 day reservation, dropping to $3.09/hr on a 91–180 day commit) is one of the lowest published rates for any managed Hopper-class GPU — and the $7.99/hr reserved B200 sets a similar floor for Blackwell, both undercutting Fireworks' on-demand rates.
· Together was co-founded by Vipul Ved Prakash (ex-Cloudmark, Topsy founder), Ce Zhang (ETH Zurich systems professor), Chris Re (Stanford ML/Snorkel), and Percy Liang (Stanford CRFM director) — making it the rare commercial product where two top academic ML labs are co-architects of the platform.
· Together's serverless rate card publishes per-model pricing inline on the pricing page (rare among competitors like Fireworks which route to docs), making per-model side-by-side comparison friction-free.

Questions & answers

How much does Together AI cost per month?: Together has no monthly subscription fee — you pay only for the serverless tokens, dedicated GPU hours, fine-tuning training tokens, and Code Sandbox usage you consume. A small RAG application using Llama 3.3 70B at 30M input + 10M output tokens would cost ~$42/month on serverless; the same workload on a dedicated H100 ($5.49/hr) running 4h/day would cost ~$660/month.
What are Together's serverless per-token rates?: Together publishes per-model rates inline on the pricing page. Sample rates per 1M tokens: Llama 3.3 70B at $1.04 input / $1.04 output; DeepSeek V4 Pro at $1.74 input / $3.48 output ($0.20 cached input); Qwen3.5 9B at $0.17 input / $0.25 output; GLM-5.1 at $1.40 input / $4.40 output. Image generation: FLUX.2 [dev] at $0.0154/image, FLUX.1 [schnell] at $0.0027/image, Stable Diffusion 3 at $0.0019/image.
What are Together's GPU rates for dedicated endpoints and clusters?: Dedicated inference (per GPU per hour, on-demand): HGX H100 at $5.49/hr, HGX B200 at $8.99/hr (H200/B300/GB200/GB300 quoted "Contact us"; all reserved dedicated capacity is "Contact sales"). On-demand GPU clusters: HGX H100 at $3.99/hr, HGX H200 at $5.99/hr, HGX B200 at $8.19/hr. Reserved clusters (7–30 day commits): H100 at $3.59/hr, H200 at $4.99/hr, B200 at $7.99/hr, dropping to $3.09/hr H100 on 91–180 day reservations — the reserved rates are among the lowest published in managed inference. Separately, a July 2026 Provisioned Throughput (PTU) SKU reserves capacity in throughput units at $0.05 per PTU-minute (MiniMax M3, GLM-5.2) for buyers who prefer a fixed tokens-per-minute envelope over per-GPU-hour rentals.
Does Together AI have a free tier?: New accounts can start without an upfront commitment, but Together does not publish a specific signup-credit dollar amount on its pricing page or quickstart docs. Calling paid serverless and image models requires a positive credit balance, and production usage requires a payment method on file. There is no permanent free tier.
How does Together's fine-tuning pricing work?: Fine-tuning is priced per 1M training tokens (LoRA vs full-parameter). Standard tier: up to 16B at $0.48 SFT LoRA / $1.20 full; 17–69B at $1.50 / $3.75; 70–100B at $2.90 / $7.25. A specialized per-model tier covers frontier architectures (DeepSeek-R1 $10 SFT LoRA, GLM-5 $40, Qwen3.5-397B $8, Llama 4 Scout $3), ranging roughly $3–$40 SFT LoRA and up to $100 DPO LoRA per 1M tokens. The standard tier has no per-job minimum, while the specialized tier carries per-model minimum charges of $6–$60 (Qwen3-235B is the exception with no minimum).
What is Together's Code Sandbox and how is it priced?: Code Sandbox is a managed code-execution environment for agentic workflows. Billed at $0.0446 per vCPU-hour and $0.0149 per GiB-hour for the sandbox runtime. Code Interpreter (a higher-level managed session API) bills at $0.03/session. Storage attached to sandbox sessions is $0.16/GiB-month.