All companies
technology

Together AI pricing

together.ai facts checked analysis reviewed
Quick summary
Product
AI Acceleration Cloud — serverless inference, dedicated endpoints, GPU clusters, Code Sandbox, fine-tuning
Industry
technology
Commits
Available (annual)
In this page
AI Summary
  • Together AI runs a multi-SKU pure-usage cloud: per-token serverless inference for popular open-weight models (Llama 3.3 70B at $0.88/$0.88, DeepSeek V4 Pro at $2.10/$4.40, Qwen3.5 9B at $0.10/$0.15, GLM-5.1 at $1.40/$4.40), per-image generation (FLUX.2 [dev] $0.0154, FLUX.1 [schnell] $0.0027, Stable Diffusion 3 $0.0019), and per-hour dedicated and cluster GPUs.
  • Dedicated inference endpoints at $6.49/hr H100 and $11.95/hr HGX B200; on-demand GPU clusters at $5.49/hr H100 and $9.95/hr B200; reserved cluster rates (7–30 day commits) at $4.99/hr H100 and $9.65/hr B200 — the reserved rates among the lowest published in the market.
  • Fine-tuning priced per 1M training tokens (LoRA / full-parameter): up to 16B at $0.48 / $1.20 SFT; 17–69B at $1.50 / $3.75; 70–100B at $2.90 / $7.25. A specialized per-model tier (DeepSeek-R1, GLM-5, Qwen3.5, gpt-oss) runs SFT LoRA $3–$40 and DPO LoRA $7.50–$100 per 1M tokens; no per-job minimum.
  • Batch API offers a flat 50% discount on most models; Code Sandbox at $0.0446/vCPU-hour and $0.0149/GiB-hour for agentic code execution; Code Interpreter at $0.03/session; storage at $0.16/GiB-month.
  • Founders include Stanford CRFM director Percy Liang and Stanford ML researcher Chris Re — making Together the rare commercial cloud with top-tier academic-lab architecture credibility on top of standard founder-CEO leadership.
  • Together raised a $305M Series B in February 2025 led by General Catalyst at $3.3B post-money; NVIDIA, Salesforce Ventures, and others participated. Series C reported in late 2025 at $5B+ valuation.
Pricing summary
Together AI 2026 — Multi-SKU AI Acceleration Cloud
Serverless tokens + dedicated endpoints + on-demand/reserved clusters + Code Sandbox; 50% Batch discount
Free trial
Sign up free
Evaluating Together for proof-of-value
Annual commit
Enterprise
Custom
Sustained workloads, regulated industries
Dedicated endpoints
$6.49 /hr (H100)
Single-tenant per-hour inference
GPU clusters
From $4.99 /hr (reserved H100)
Training and large-batch inference
No monthly fee. Reserved cluster rates require 7–30 day commits. Batch API 50% discount stacks with neither cached input nor reserved rates. Code Sandbox bills per vCPU-hour and GiB-hour separately.

About

Together AI is a San Francisco-based generative AI cloud company founded in June 2022 by Vipul Ved Prakash (ex-Topsy CEO and Cloudmark founder), Ce Zhang (then ETH Zurich systems professor, now at the University of Chicago), Chris Re (Stanford ML and Snorkel co-founder), and Percy Liang (Stanford CRFM director). The product is an AI Acceleration Cloud — a managed inference, training, and code-execution platform optimized for open-source models and customer-fine-tuned variants — combining per-token serverless inference, per-hour dedicated endpoints, per-hour GPU clusters (on-demand and reserved), Code Sandbox / Code Interpreter for agentic workflows, and a fine-tuning service. The runtime is built on Together’s proprietary Together Inference Engine with FlashAttention-3 kernels and speculative decoding pipelines.

By 2026 Together serves Salesforce, Zoom, Pika Labs, Hippocratic AI, Cartesia, Arc Institute, and roughly 1,500 other paying customers spanning enterprise AI infrastructure (RAG systems, multi-tenant fine-tunes, large-batch inference), academic research labs running open-source training, and AI-native startups serving production workloads. The company raised a $305M Series B in February 2025 led by General Catalyst at a $3.3B post-money valuation with NVIDIA, Salesforce Ventures, Coatue, and Kleiner Perkins participation; a Series C reported in late 2025 brought valuation past $5B.

Together competes with Fireworks AI, Baseten, Replicate, Anyscale, and Groq for the managed-inference market, plus hyperscaler offerings (AWS Bedrock, Vertex AI, Azure ML). Its differentiation is the combination of academic-lab founder credibility (Stanford CRFM + Stanford ML), one of the broadest open-source model catalogs in the industry, aggressive reserved cluster pricing (H100 at $4.99/hr is among the lowest published rates), and Code Sandbox as a non-token SKU that captures agentic code-execution workloads without forcing customers onto third-party sandbox providers.


Pricing summary : How Together’s multi-SKU AI Acceleration Cloud is priced

Together runs four parallel pricing surfaces on a unified credits balance. Serverless inference charges per million input/output tokens by model, with per-model rates published inline on the pricing page (rare among competitors who route to docs). Dedicated endpoints are single-tenant per-hour rentals at $6.49/hr H100 and $11.95/hr B200, optimized for sustained-QPS workloads. GPU clusters are multi-node per-hour rentals for training and large-batch inference at $5.49/hr on-demand H100 and $4.99/hr reserved H100 (7–30 day commit). Code Sandbox and Code Interpreter bill per vCPU-hour, GiB-hour, and per-session for agentic code execution.

A Batch API offers a flat 50% discount on most serverless models for asynchronous workloads. Fine-tuning is priced per 1M training tokens by model size and method, with a specialized tier ($10–$100+/1M, $20–$60 minimum per job) for frontier architectures like DeepSeek-R1 and GLM-5. Enterprise commitments unlock volume discounts on top of reserved cluster rates and enable VPC deployment, custom SLAs, and dedicated solutions engineering. This four-SKU pure-usage architecture — token / image / hour / vCPU-hour — is one of the most expansive usage-based rate cards in AI infrastructure.

What makes this different: Reserved cluster pricing at $4.99/hr H100 (7–30 day commit) — and as low as $3.99/hr on a 91–180 day reservation — sits well below typical on-demand H100 rates from peers like Fireworks AI and Baseten. Together accepts a higher utilization risk (customer commits 7–30 days regardless of usage) in exchange for delivering lower per-hour cost — a structural choice that captures large-batch training and inference customers who can guarantee sustained utilization.


Pricing by product

Serverless inference (per-token, chat models)

ModelInput ($/1M)Output ($/1M)
Llama 3.3 70B$0.88$0.88
DeepSeek V4 Pro$2.10$4.40
Qwen3.5 9B$0.10$0.15
GLM-5.1$1.40$4.40

Image generation (per image)

ModelRate
FLUX.2 [dev]$0.0154
FLUX.1 [schnell]$0.0027
Stable Diffusion 3$0.0019

Dedicated endpoints (single-tenant per-hour)

InstancePer-hour rate
1× H100 80GB$6.49
1× HGX B200 180GB$11.95

GPU clusters (multi-node)

ModeH100 ($/hr)B200 ($/hr)Commitment
On-demand$5.49$9.95None
Reserved (7–30 day)$4.99$9.657–30 days

Fine-tuning — standard tier (per 1M training tokens)

Base model sizeSFT LoRASFT fullDPO LoRADPO full
Up to 16B$0.48$1.20$0.54$1.35
17B – 69B$1.50$3.75$1.65$4.12
70B – 100B$2.90$7.25$3.20$8.00

Fine-tuning — specialized per-model tier (per 1M training tokens)

Frontier architectures are priced per model on a separate “Specialized” tab (SFT LoRA / DPO LoRA shown):

ModelSFT LoRADPO LoRA
Llama 4 Scout$3.00$7.50
gpt-oss-120B$5.00$12.50
Qwen3.5-397B-A17B$8.00$20.00
DeepSeek-R1$10.00$25.00
GLM-5 / GLM-5.1$40.00$100.00

Specialized rates span roughly $3–$40 SFT LoRA and $7.50–$100 DPO LoRA per 1M tokens. Together’s docs state there is no per-job minimum — you pay only for tokens processed.

Code Sandbox + Code Interpreter

ResourceRate
Code Sandbox (vCPU)$0.0446/vCPU-hour
Code Sandbox (memory)$0.0149/GiB-hour
Code Interpreter (session)$0.03/session
Storage (sandbox or model)$0.16/GiB-month

Sales motions across products: PLG / self-serve for serverless, on-demand clusters, and Code Sandbox; sales-led for Reserved cluster commits, Enterprise annual contracts, and VPC deployments. All prices accessed 2026-05-30 from together.ai/pricing.


Hidden costs : What Together AI customers actually pay beyond the rate card

Archetype A: AI-native startup running Llama 3.3 70B serverless with bursty traffic

A growth-stage AI assistant startup serving ~75K requests/day, average 1.5K input + 400 output tokens, with traffic concentrated in business hours:

Line itemMonthly cost
Input tokens (3.4M/day × 30 = 101M, Llama 70B at $0.88/1M)$89
Output tokens (900K/day × 30 = 27M, Llama 70B at $0.88/1M)$24
Batch API for nightly summarization workflows (10M tokens, -50%)$4
Code Interpreter for occasional agent execution (300 sessions × $0.03)$9
Estimated total~$126/month

For bursty traffic without sustained QPS, serverless dominates and the bill is dominated by per-token cost. Moving to a dedicated H100 endpoint ($6.49/hr) would cost ~$4,700/month — only economical if sustained QPS rises above ~4 req/sec.

Archetype B: Mid-market team running a Llama 70B fine-tune on reserved H100 cluster

A team that fine-tuned Llama 3.3 70B (full-parameter SFT, 25M training tokens) and runs sustained inference on a reserved H100 cluster:

Line itemMonthly cost
Initial fine-tuning (one-time, 25M tokens × $3.75)$94
Reserved H100 (24h × 30 × $4.99/hr)$3,593
Storage for model artifacts + sandbox (50GiB × $0.16)$8
Code Sandbox for agent execution (200 vCPU-hours × $0.0446)$9
Estimated total~$3,700/month (after one-time $94 fine-tune)

Reserved cluster pricing dominates the bill at sustained utilization — and the $4.99/hr H100 reserved rate makes Together one of the cheapest published managed-inference platforms for training and large-batch workloads. The trade-off is the 7–30 day commit: even idle hours cost the customer.

Want to estimate your own Together AI bill? Use the Together AI pricing calculator to model serverless tokens, dedicated GPU hours, reserved cluster commits, and Code Sandbox costs.


Pricing evolution : Together’s pricing history from decentralized GPU pooling to AI Acceleration Cloud

Cadence

QuarterPrice changesProduct / SKU additionsNotes
2022 Q201Together founded; decentralized GPU pooling product
2023 Q401Inference Cloud GA + Series A ($102.5M)
2024 Q101Dedicated endpoints + fine-tuning launched
2024 Q311GPU Clusters launched at $5.49/hr on-demand H100
2025 Q100Series B ($305M) at $3.3B valuation
2025 Q201Batch API + Code Sandbox + Code Interpreter
2025 Q401FLUX.2 + FLUX-schnell + Stable Diffusion 3 image SKUs
2026 Q110Specialized fine-tuning tier (DeepSeek-R1, GLM-5) at $10–$100+

Tracked range: 2022 Q2–2026 Q1. Quarters not listed above were verified stable (0 price changes, 0 SKU additions).

Notable changes

  • 2023-11-29 — Inference Cloud GA with per-token serverless API; established Together as a Cloudflare-for-LLM-inference contender.
  • 2024-03-12 — Dedicated endpoints + fine-tuning launched; expanded from single-SKU per-token to multi-SKU platform.
  • 2024-09-20 — GPU Clusters launched at $5.49/hr on-demand H100 and $4.99/hr reserved (7–30 day commit); some of the lowest published H100 rates in managed inference.
  • 2025-06-18 — Batch API at 50% discount launched; Code Sandbox + Code Interpreter added non-token SKUs to the rate card.
  • 2025-10-08 — FLUX.2 + FLUX-schnell + Stable Diffusion 3 image generation SKUs launched at per-image rates.
  • 2026-01-15 — Specialized fine-tuning tier launched for DeepSeek-R1, GLM-5, and other large-context frontier architectures; reflected higher infrastructure cost of training on newer architectures.

What’s unique : Together AI’s distinctive pricing mechanics

1. Per-model serverless rates published inline on the pricing page. Most inference middleware (Fireworks, Baseten) lists discount mechanics on the pricing page but routes to docs for per-model rates. Together’s inline display lets self-serve buyers compare model economics side-by-side without context switching — a pricing transparency UX advantage that materially reduces evaluation friction.

2. Reserved cluster pricing ($4.99/hr H100) for 7–30 day commits. Most platforms offer either on-demand (high price, no commit) or annual commits (lowest price, year-long lock-in). Together’s 7–30 day reserved tier creates a middle path: customers commit a week to a month, capture a meaningful discount over on-demand, and avoid annual lock-in. This commitment-flexibility innovation captures large-batch training customers who would balk at annual commits.

3. Code Sandbox as a non-token agentic SKU. Code Sandbox bills per vCPU-hour and GiB-hour — a fundamentally different metric than tokens — and Code Interpreter bills per session. Adding non-token SKUs to a token-dominated rate card lets Together capture agentic code-execution workloads without forcing customers onto third-party sandbox providers (E2B, Modal). The unified billing reduces vendor count for AI-native teams building autonomous agents.

4. Academic-lab founder credibility (Stanford CRFM + Stanford ML). Together’s co-founders include Stanford CRFM director Percy Liang and Stanford ML researcher Chris Re — making the platform’s optimization claims and model curation credible in a way that pure-engineering teams cannot replicate. The CRFM Helm leaderboard, the academic stewardship of open-source models, and the Together rate card share a knowledge base.

5. Multi-mode cluster pricing (on-demand + reserved) at different commit windows. Most clusters force a single mode choice (on-demand-only or reserved-only); Together’s three-tier structure (on-demand, 7–30 day reserved, Enterprise annual commit) lets customers match commit duration to workload predictability. This granular commitment design accommodates training cycles (weeks) and steady production (months) without forcing one model to fit both.


Strengths & weaknesses

StrengthsWeaknesses
Per-model serverless rates published inline — best transparency in the categoryPer-model rates require reading a long inline table; no comparison filter
Reserved cluster H100 at $4.99/hr is among the lowest published in managed inferenceReserved commits require 7–30 day duration — idle hours still billed
Academic-lab founder credibility (CRFM + Stanford ML)Code Sandbox vCPU-hour rates need separate cost-modeling alongside token spend
Code Sandbox + Code Interpreter capture agentic code-execution without third-party toolsNo published cached input discount on serverless inference
FLUX.2 + FLUX-schnell + SD3 image SKUs unified in rate cardA100 not prominently listed — A100 capacity available but not on the headline rate card
Multi-mode cluster pricing (on-demand, 7–30 day reserved, annual) accommodates many workload typesSpecialized fine-tuning tier (up to $40 SFT LoRA / $100 DPO LoRA per 1M) prices frontier-model tuning well above the standard tier

Billing UX : Together AI’s account controls and payment experience

  • Self-serve signup — Sign up at api.together.ai with email; trial credits applied automatically. Credit card required for production usage.
  • Unified credits balance — Serverless tokens, dedicated GPU hours, GPU cluster hours, fine-tuning training tokens, and Code Sandbox usage all bill against the same workspace credits balance.
  • Per-request usage metadata — API responses include input tokens, output tokens, and per-request cost so client applications can compute and surface real-time cost.
  • Per-model rate visibility — Pricing page displays per-model rates inline; dashboard shows live consumption per model and per SKU.
  • Spend alerts — Configurable email and webhook alerts at $X spend per period.
  • Payment methods — Credit card and ACH on self-serve; wire transfer, invoice billing, and AWS/GCP Marketplace on Enterprise.
  • Cluster reservation booking — 7–30 day GPU cluster reservations bookable directly via the dashboard with confirmed start dates; cancellation policies vary by SKU.
  • Audit logging + RBAC — Workspace-level RBAC on Pro+; SOC 2 audit-log exports on Enterprise via S3 or webhook.
  • Multi-region availability — US and EU regions standard for serverless; reserved clusters available in additional regions on Enterprise commitments.

Strategic wins : Why Together AI’s pricing decisions worked

1. Inline per-model rate publication removed evaluation friction

By publishing per-model rates on the pricing page rather than routing to docs, Together let self-serve buyers compare model economics in a single context. This transparency converts more self-serve customers and reduces sales-led overhead for low-value-deal segments. Most competitors’ docs-routing UX loses cost-sensitive evaluators who never get to the rate card before churning.

2. 7–30 day reserved cluster tier captured the middle of the commit-duration spectrum

Annual commits lock too much for many training and large-batch customers; on-demand is too expensive for sustained workloads. Together’s 7–30 day reserved tier created a middle option that converts customers who would otherwise self-build on raw cloud. The $4.99/hr H100 reserved rate is low enough to compete with hyperscaler EDP-discounted rates without forcing year-long commitments.

3. Code Sandbox as a non-token SKU expanded TAM beyond token-only buyers

Adding Code Sandbox ($0.0446/vCPU-hour) and Code Interpreter ($0.03/session) gave Together a SKU that captures agentic code-execution workloads — workloads that would otherwise go to E2B, Modal, or Pyodide. The unified billing balance reduces vendor count for AI-native teams building autonomous agents, locking in wallet share.

4. Academic-lab founder credibility as the platform-runtime trust anchor

Stanford CRFM director Percy Liang and Stanford ML researcher Chris Re as co-founders give Together unusual academic-lab credibility that customers extend to model curation, optimization claims, and platform design. For enterprise procurement leaders evaluating inference middleware, this founder profile distinguishes Together from pure-engineering teams in a way that is hard to replicate.


Areas to improve : Gaps in Together’s pricing approach

1. No published cached input discount on serverless

Fireworks, OpenAI, Anthropic, and Baseten all ship 50–80% cached input discounts on serverless inference. Together does not — meaning RAG and agent-loop workloads with high prefix re-use cost more on Together than on competitors despite Together’s lower base rates. Adding cached input as a serverless discount mechanic would eliminate a meaningful competitive gap.

2. Per-model rate comparison needs better filtering

The inline per-model rate table is comprehensive but long. Customers comparing 5–10 models must scroll and scan rather than filter. Adding a per-model filter / sort / search UI on the pricing page would convert more evaluation traffic into pilots without requiring API exploration.

3. Specialized fine-tuning tier (up to $40 SFT LoRA / $100 DPO LoRA per 1M) creates budget uncertainty

The wide per-model spread — from $3 SFT LoRA on Llama 4 Scout to $40 on GLM-5, and up to $100 DPO LoRA — makes it hard for finance teams to forecast fine-tuning budgets on DeepSeek-R1 or GLM-5 without reading the per-model table. The rates are published per model on a separate “Specialized” tab, so a per-model price calculator would further reduce friction for frontier-fine-tuning workloads that currently go to first-party providers.

4. A100 not on the headline rate card

The on-demand rate card lists H100 and B200 prominently but not A100. For non-frontier workloads that fit comfortably on A100, customers may compare Together’s headline H100 rate to a competitor’s published A100 rate and conclude Together is more expensive. Publishing an A100 rate (even at a “limited availability” disclaimer) would prevent unfavorable comparison.


Key takeaways

  1. Inline per-model rate publication beats docs-routing for self-serve conversion. Together’s pricing page transparency converts evaluators that competitors lose to context switching. Self-serve usage-based platforms should display per-SKU per-model rates directly on the pricing page rather than route to documentation.

  2. Multi-window reserved pricing (7–30 day, annual) captures more buyers than on-demand-or-annual binaries. The middle commit-duration tier converts customers with training cycles that don’t fit either extreme — and Together’s $4.99/hr reserved H100 rate at this tier is among the lowest published in managed inference.

  3. Non-token SKUs (Code Sandbox, Code Interpreter) expand TAM beyond token-only inference buyers. As agentic workflows scale, code-execution sandbox SKUs are becoming table stakes for inference platforms targeting AI-native teams.

  4. Academic-lab founder credibility is a defensible trust anchor. Stanford CRFM and ML lab credentials extend customer trust from founder vision to model curation, optimization claims, and platform design — a trust multiplier competitors cannot replicate without acquiring similar talent.

  5. Cached input discount is becoming table stakes for serverless inference. Together’s lack of a published cached input mechanic is the most meaningful competitive gap versus Fireworks, OpenAI, and Anthropic — and likely costs Together RAG and agent-loop workloads despite lower base rates.


UBP implications

  1. Pricing page transparency converts more self-serve revenue than docs-routing. Usage-based platforms should default to inline per-SKU per-model rate display. Docs-routing is acceptable for advanced SKUs (specialized fine-tuning, custom enterprise terms) but should not be the default for the top-traffic surfaces.

  2. Multi-window commitment pricing captures buyer segments that on-demand-or-annual binaries miss. The 7–30 day reserved tier is the canonical structure for training and large-batch workloads where annual commits over-allocate and on-demand under-allocates.

  3. Non-token usage SKUs (vCPU-hour, GiB-hour, per-session) are becoming necessary for inference platforms targeting agentic workflows. Token-only rate cards leave code-execution and sandbox workloads on the table for third-party vendors that customers prefer to consolidate.


Sources


Bottom line

Together AI priced its AI Acceleration Cloud around four structural ideas: inline per-model rate publication on the pricing page (best transparency in the category), aggressive reserved cluster pricing at $4.99/hr H100 with a 7–30 day commit window (lowest published in managed inference), non-token SKUs (Code Sandbox, Code Interpreter) that capture agentic workloads without third-party vendors, and Stanford CRFM + Stanford ML founder credibility that distinguishes Together from pure-engineering platforms. The multi-mode cluster pricing (on-demand, 7–30 day reserved, annual commit) and four-SKU rate card (token / image / hour / vCPU-hour) make Together one of the most expansive usage-based platforms in AI infrastructure.

For AI engineering teams running training cycles, large-batch inference, and agentic code execution at scale, Together is the most legible commercial platform — and the $4.99/hr reserved H100 rate is itself a structural cost advantage. The remaining gaps (no cached input discount on serverless, no A100 on the headline rate card, specialized fine-tuning tier budget uncertainty) are competitive parity issues rather than structural pricing flaws.

Compare with peers via the blueprint corpus, or model your own spend with the Together AI pricing calculator.

Pricing timeline : Major events on a vertical axis

Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.

Specialized Model Fine-Tuning Tier

Together added a specialized model fine-tuning tier for DeepSeek-R1, GLM-5, and other large-context models at $10–$100+ per 1M training tokens with $20–$60 per-job minimum charges. Reflected the higher infrastructure cost of training on the latest frontier architectures.

Specialized Model Fine-Tuning Tier screenshot 1
Specialized Model Fine-Tuning Tier screenshot 2

FLUX.2 Image Generation at $0.0154/image

Together added FLUX.2 [dev] image generation at $0.0154/image, FLUX.1 [schnell] at $0.0027/image, and Stable Diffusion 3 at $0.0019/image. Per-image pricing positioned alongside per-token text inference as a unified rate card.

Batch API + Code Sandbox Launched

Together launched Batch API (50% discount on most models) for asynchronous inference workloads, and Code Sandbox ($0.0446/vCPU-hour, $0.0149/GiB-hour) for agentic code execution. Code Interpreter at $0.03/session added a session-billed SKU to the rate card.

Series B ($305M) at $3.3B Valuation

Together raised a $305M Series B led by General Catalyst at a $3.3B post-money valuation. NVIDIA, Salesforce Ventures, Coatue, and others participated. The round funded Code Sandbox, FLUX image generation, and the Together AI Acceleration Cloud rebrand.

GPU Clusters at $5.49/hr H100 On-Demand

Together launched GPU Clusters — on-demand multi-GPU rentals for training and large-batch inference. Pricing at $5.49/hr H100 on-demand and $9.95/hr B200 undercut Fireworks' and Baseten's dedicated rates substantially. Reserved 7–30 day commitments dropped the H100 rate to $4.99/hr.

Dedicated Endpoints + Fine-Tuning Launched

Together added dedicated single-tenant endpoints (per-hour H100, A100) and a fine-tuning service. Established the multi-SKU architecture: serverless per-token + dedicated per-hour + fine-tuning per training-token that remains the canonical structure today.

Series A ($102.5M) + Inference Cloud GA

Together raised a $102.5M Series A led by Kleiner Perkins with NEA, Lux, and others. Inference Cloud went GA with per-token serverless API for Llama 2, Falcon, Code Llama, and Stable Diffusion. Initial pricing was a flat per-million-token rate by model class.

Together Founded

Vipul Ved Prakash (ex-Topsy, Cloudmark) co-founded Together with Ce Zhang (ETH Zurich), Chris Re (Stanford), and Percy Liang (Stanford CRFM director). Initial product was decentralized GPU pooling for open-source model training, evolving rapidly into a managed inference cloud through 2023.

Trivia
  • · Together AI's $4.99/hr H100 reserved rate (7–30 day reservation) is one of the lowest published rates for any managed Hopper-class GPU — and the $9.65/hr reserved B200 sets a similar floor for Blackwell, both undercutting Fireworks' on-demand rates.
  • · Together was co-founded by Vipul Ved Prakash (ex-Cloudmark, Topsy founder), Ce Zhang (ETH Zurich systems professor), Chris Re (Stanford ML/Snorkel), and Percy Liang (Stanford CRFM director) — making it the rare commercial product where two top academic ML labs are co-architects of the platform.
  • · Together's serverless rate card publishes per-model pricing inline on the pricing page (rare among competitors like Fireworks which route to docs), making per-model side-by-side comparison friction-free.

Questions & answers

How much does Together AI cost per month?
Together has no monthly subscription fee — you pay only for the serverless tokens, dedicated GPU hours, fine-tuning training tokens, and Code Sandbox usage you consume. A small RAG application using Llama 3.3 70B at 30M input + 10M output tokens would cost ~$35/month on serverless; the same workload on a dedicated H100 ($6.49/hr) running 4h/day would cost ~$780/month.
What are Together's serverless per-token rates?
Together publishes per-model rates inline on the pricing page. Sample rates per 1M tokens: Llama 3.3 70B at $0.88 input / $0.88 output; DeepSeek V4 Pro at $2.10 input / $4.40 output; Qwen3.5 9B at $0.10 input / $0.15 output; GLM-5.1 at $1.40 input / $4.40 output. Image generation: FLUX.2 [dev] at $0.0154/image, FLUX.1 [schnell] at $0.0027/image, Stable Diffusion 3 at $0.0019/image.
What are Together's GPU rates for dedicated endpoints and clusters?
Dedicated single-endpoint: H100 80GB at $6.49/hr, HGX B200 180GB at $11.95/hr. On-demand GPU clusters: HGX H100 at $5.49/hr, HGX B200 at $9.95/hr. Reserved clusters (7–30 day commits): H100 at $4.99/hr, B200 at $9.65/hr — the reserved rates are among the lowest published in managed inference.
Does Together AI have a free tier?
New accounts can start without an upfront commitment, but Together does not publish a specific signup-credit dollar amount on its pricing page or quickstart docs. Calling paid serverless and image models requires a positive credit balance, and production usage requires a payment method on file. There is no permanent free tier.
How does Together's fine-tuning pricing work?
Fine-tuning is priced per 1M training tokens (LoRA vs full-parameter). Standard tier: up to 16B at $0.48 SFT LoRA / $1.20 full; 17–69B at $1.50 / $3.75; 70–100B at $2.90 / $7.25. A specialized per-model tier covers frontier architectures (DeepSeek-R1 $10 SFT LoRA, GLM-5 $40, Qwen3.5-397B $8, Llama 4 Scout $3), ranging roughly $3–$40 SFT LoRA and up to $100 DPO LoRA per 1M tokens. Together's docs state there is no per-job minimum — you pay only for tokens processed.
What is Together's Code Sandbox and how is it priced?
Code Sandbox is a managed code-execution environment for agentic workflows. Billed at $0.0446 per vCPU-hour and $0.0149 per GiB-hour for the sandbox runtime. Code Interpreter (a higher-level managed session API) bills at $0.03/session. Storage attached to sandbox sessions is $0.16/GiB-month.