AI Summary
About
Cerebras Systems is a Santa Clara-based AI hardware and cloud inference company founded in 2016 by Andrew Feldman (CEO) and Gary Lauterbach (CTO). The company’s central innovation is the Wafer Scale Engine (WSE): a single silicon die occupying an entire semiconductor wafer, containing up to 4 trillion transistors and 900,000 AI-optimized cores. By eliminating the inter-chip communication bottleneck that limits GPU clusters, the WSE achieves dramatically higher throughput for large model inference at lower latency.
Cerebras operates two distinct product lines. The first is Cerebras Inference: a public cloud API that delivers hosted LLM inference on open-source models (Llama, Qwen, GPT-OSS) at speeds 10–20× faster than GPU-cloud alternatives, billed per million tokens with a free developer tier. The second is the CS-3 compute system: a rack-scale appliance housing the WSE-3 chip, sold under enterprise and government contracts for on-premises or managed deployment by research institutions, national laboratories, healthcare systems, and large enterprises.
The company raised approximately $750M in total venture funding prior to its 2024 IPO attempt. Cerebras filed its S-1 in August 2024 targeting an ~$8 billion valuation, but the IPO was withdrawn in November 2024 after CFIUS opened a national-security review related to G42, the UAE-based AI conglomerate that had been Cerebras’s largest customer and which had prior ties to Huawei. Cerebras subsequently raised a private funding round at a comparable valuation to continue operations while the regulatory situation resolved. By 2026 the company reports annual recurring revenue in the $100M–$500M range, driven primarily by enterprise hardware contracts and growing inference API revenue.
Pricing summary : Tiered API access, per-token rates, and fixed coding plans
Cerebras Inference is sold through three access tiers plus a separate coding subscription. The free tier gives rate-limited access to every public model with Discord support. The self-serve Developer tier starts at just $10 and unlocks 10x higher rate limits and higher-priority processing. The Enterprise tier (contact sales) adds custom model weights, dedicated queue priority, and guaranteed uptime. Underlying consumption is billed per million input and output tokens against the published rate card.
The public per-token rate card lists two models: GPT-OSS-120B at $0.35 input/$0.75 output per million tokens (a production model running at ~3,000 tokens/second) and ZAI-GLM-4.7 at $2.25/$2.75 per million tokens — explicitly a Preview model intended for evaluation only, not production use. A much wider catalog (Llama 3.3 70B, Llama 4, Qwen3, Mistral, DeepSeek, Kimi K2.x) is available through Dedicated Endpoints on reserved-capacity custom pricing rather than the public rate card.
Separately, Cerebras Code is a fixed-price coding subscription: Pro at $50/month (up to 24M tokens/day) and Max at $200/month (up to 120M tokens/day). Both tiers were sold out at the time of writing. The CS-3 hardware product remains an entirely separate commercial motion: enterprise contracts negotiated by a direct sales team, with per-unit pricing not publicly disclosed.
What makes this different: Cerebras charges for speed — but doesn’t charge a speed premium. Every token processed through Cerebras Inference arrives 10–20× faster than GPU-cloud equivalents at pricing that matches or undercuts those slower alternatives. This inversion of the traditional cost/performance tradeoff is the core commercial proposition. See how AI inference providers are restructuring their pricing models for context on why this matters.
Pricing by product
Cerebras Inference — access tiers
| Tier | Price | What you get | Sales motion |
|---|---|---|---|
| Free | Free | All public models, rate-limited, Discord support | Self-serve |
| Developer | From $10 (self-serve) | 10x higher rate limits, higher-priority processing | Self-serve / PLG |
| Enterprise | Contact sales | Custom weights, guaranteed uptime, dedicated queue priority, fine-tuning | Sales-led |
Cerebras Inference — public per-token rate card
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Speed (est.) | Status |
|---|---|---|---|---|
| GPT-OSS-120B | $0.35 | $0.75 | ~3,000 tok/sec | Production |
| ZAI-GLM-4.7 | $2.25 | $2.75 | ~1,000 tok/sec | Preview (evaluation only) |
Additional model families — Llama 3.3 70B, Llama 4 Maverick/Scout, Qwen3-32B and Qwen3-235B, Qwen3-Coder, Mistral, DeepSeek, and Kimi K2.x — are available through Dedicated Endpoints on reserved-capacity custom pricing (contact sales), not on the public rate card.
Cerebras Code — coding subscriptions
| Plan | Price | Daily token allowance | Target | Availability |
|---|---|---|---|---|
| Pro | $50/month | Up to 24M tokens/day | Indie devs, simple agentic workflows | Sold out |
| Max | $200/month | Up to 120M tokens/day | Full-time dev, IDE integrations, multi-agent | Sold out |
Cerebras CS-3 Compute Systems (enterprise hardware)
| Product | Use case | Pricing model | Notes |
|---|---|---|---|
| CS-3 on-premises | In-house training + inference | Enterprise contract | Multi-year; pricing not public |
| CS-3 managed cloud | Cloud-connected dedicated cluster | Enterprise contract | Includes managed ops |
| Cerebras Model Studio | Hosted enterprise inference on private models | Negotiated usage | Custom deployment on CS-3 |
Sales motions across products: PLG / self-serve for the Inference API (free → $10 Developer tier) and Cerebras Code subscriptions; sales-led for Dedicated Endpoints, CS-3 hardware systems, and enterprise managed deployments. Per-token rates and tier details sourced from cerebras.ai/pricing and inference-docs.cerebras.ai, accessed 2026-05-30.
Hidden costs : What developers actually pay beyond the base rate
Archetype A: Startup building a customer-facing chatbot on GPT-OSS-120B
A startup routing 10 million input tokens and 3 million output tokens per day through Cerebras for a customer-facing support chatbot, on the public per-token rate card (GPT-OSS-120B, $0.35 in / $0.75 out):
| Line item | Monthly cost |
|---|---|
| Input tokens: 300M × $0.35/1M | approximately $105.00 |
| Output tokens: 90M × $0.75/1M | approximately $67.50 |
| Rate-limit overages / retry overhead (~5%) | approximately $9.00 |
| Estimated total | approximately $181/month |
A team writing or iterating on code rather than serving chat traffic might instead reach for a Cerebras Code Pro subscription at $50/month (up to 24M tokens/day), which converts heavy daily coding usage into a fixed, predictable bill rather than metered per-token spend — though as of writing both Cerebras Code tiers were sold out, pushing those users back to the metered Developer tier.
Archetype B: Research team running long-context batch processing on GPT-OSS-120B
A research team running nightly batch jobs: 1 billion input tokens, 200 million output tokens per month, using GPT-OSS-120B for structured reasoning tasks:
| Line item | Monthly cost |
|---|---|
| Input tokens: 1B × $0.35/1M | $350.00 |
| Output tokens: 200M × $0.75/1M | $150.00 |
| Context overhead (system prompts per call, ~10%) | ~$50.00 |
| Estimated total | ~$550/month |
Note: GPT-OSS-120B has a 32K max output window (versus 128K for Llama models). Teams generating very long outputs will need to chain calls, increasing input token costs on subsequent turns by feeding prior output as context.
Use the Cerebras pricing calculator to model your own monthly cost based on model selection, token volume, and input/output ratio.
Pricing evolution : From wafer-scale hardware vendor to inference API competitor
Cadence
| Quarter | Price changes | Product / SKU additions | Notes |
|---|---|---|---|
| 2024 Q3 | 0 | 2 | Cerebras Inference public beta launched; Llama 3.1 8B and 70B added at initial rate card |
| 2024 Q4 | 0 | 0 | IPO blocked by CFIUS; inference API remained stable; Llama 3.3 70B pricing not yet published |
| 2025 Q1 | 1 | 1 | Llama 3.3 70B added at $0.85/$1.20 (premium over Llama 3.1 70B at $0.60/$0.60); first asymmetric input/output pricing |
| 2025 Q2 | 0 | 2 | GPT-OSS-120B ($0.35/$0.75) and Qwen-3-32B ($0.40/$0.80) added |
| 2025 Q3 | 0 | 2 | ZAI-GLM-4.6 ($2.25/$2.75) and ZAI-GLM-4.7 ($2.25/$2.75) added — first premium-priced models |
| 2026 Q1 | 0 | 0 | ZAI-GLM-4.6 deprecated January 2026; ZAI-GLM-4.7 remains active |
| 2026 Q2 | — | 2 | Cerebras Code Pro ($50/mo) and Max ($200/mo) coding subscriptions launched; public per-token rate card narrowed to GPT-OSS-120B + ZAI-GLM-4.7, with Llama/Qwen3 moved to Dedicated Endpoints |
Tracked range: 2024 Q3–2026 Q2. Per-token rates and tier details sourced from cerebras.ai/pricing and inference-docs.cerebras.ai. Quarters not listed above were verified stable.
Notable changes
- 2024-08-29 — Cerebras Inference launched in public beta with Llama 3.1 8B ($0.10/$0.10) and Llama 3.1 70B ($0.60/$0.60). Both models offered at symmetrical input/output pricing, a simplification common in early-stage inference APIs.
- 2024-11-26 — IPO withdrawal announced following CFIUS review. No pricing changes accompanied the event; the company continued to operate Cerebras Inference unchanged.
- 2025 Q1 — Llama 3.3 70B introduced at $0.85 input / $1.20 output — the first asymmetric price pair in the Cerebras catalog, reflecting the standard industry shift toward differentially priced output tokens as generation is more compute-intensive than prefill.
- 2025-05 — GPT-OSS-120B added at $0.35/$0.75; notably cheaper input pricing than Llama 3.3 70B despite being a larger model, reflecting Cerebras’s strategic interest in establishing itself as the fastest platform for OpenAI’s open-weight model.
- 2025 Q3 — ZAI-GLM-4.6 and 4.7 (Zhipu AI multilingual models) added at $2.25/$2.75 — a 2.6–2.3× premium over Llama 3.3 70B, representing the first specialized-model premium on the platform.
- 2026-01-20 — ZAI-GLM-4.6 deprecated; ZAI-GLM-4.7 continues as the sole ZAI model.
- 2026 Q2 — Cerebras restructured its commercial model: three access tiers (Free, a self-serve Developer tier from $10, and Enterprise), a public per-token rate card narrowed to GPT-OSS-120B ($0.35/$0.75) and the Preview-only ZAI-GLM-4.7 ($2.25/$2.75), Llama and Qwen3 families relocated to Dedicated Endpoints on custom pricing, and two new fixed-price Cerebras Code coding subscriptions — Pro at $50/month and Max at $200/month — both of which sold out at launch.
What’s unique : Speed-first pricing that inverts the inference cost/performance curve
1. The price-speed inversion: faster costs less than slower. On GPU-based inference platforms (Together AI, Fireworks AI, Replicate), higher throughput typically requires reserved capacity or higher pricing tiers. Cerebras Inference delivers 10–20× faster token generation at the same or lower per-token price as GPU alternatives. This is not a promotional rate — it reflects the WSE’s architectural efficiency advantage: on-chip SRAM eliminates the HBM memory bandwidth bottleneck that forces GPU inference to batch tokens slowly. For developers building interactive applications where latency is a product feature, Cerebras offers a genuinely different cost/performance profile. See choosing the right usage metric for how latency shapes value metric selection.
2. Symmetric vs. asymmetric pricing as a maturity signal. Early Cerebras models (Llama 3.1 8B, Llama 3.1 70B) carried identical input and output token prices — a simplification that underprices output generation relative to actual compute cost. Newer models (Llama 3.3 70B, GPT-OSS-120B, Qwen-3-32B) have adopted the industry-standard asymmetric structure where output costs 1.5–3× more than input. This evolution mirrors the broader shift in AI pricing models as inference operators get a better handle on their actual per-token compute cost curves.
3. Free tier as a speed demonstration, not an acquisition gimmick. Cerebras’s free tier is rate-limited but not model-limited — every public model is accessible with only Discord support. The strategic logic is that speed sells itself: a developer who runs GPT-OSS-120B at ~3,000 tokens/second on Cerebras’s free tier (versus a fraction of that on a GPU competitor’s free tier) experiences the value proposition directly. The free tier functions as a continuous interactive product demo, not merely a lead-gen form. This aligns with PLG strategies in AI infrastructure.
4. Vertical integration from chip to API, no third-party silicon dependency. Unlike every other LLM inference API (OpenAI, Anthropic, Groq, Together AI — all running on Nvidia GPUs or TPUs), Cerebras controls the full stack from silicon to API. This vertical integration provides pricing stability independent of Nvidia supply chain constraints and licensing costs. It also enables custom optimization at the hardware-software interface that no GPU-based operator can replicate. For enterprise buyers evaluating infrastructure lock-in risk, this is a meaningful architectural differentiator.
5. Dual product line creating two distinct commercial motions. Cerebras operates two fundamentally different businesses under one brand: a consumption-API business (Cerebras Inference) targeting developers with PLG acquisition, and a hardware enterprise business (CS-3 systems) targeting research institutions with multi-year sales cycles. This dual structure creates distinct revenue streams — recurring inference API revenue plus lumpy hardware contract revenue — a combination that complicates financial modeling but reduces customer concentration risk over time.
Strengths & weaknesses
| Strengths | Weaknesses |
|---|---|
| Fastest publicly available LLM inference by a significant margin (10–20× vs. GPU clouds) | Narrow model catalog limited to open-source models; no access to proprietary models (GPT-4o, Claude, Gemini) |
| Competitive per-token pricing — matches or undercuts GPU-based competitors at equivalent quality | G42 / CFIUS situation created revenue concentration risk and blocked a liquidity event for investors |
| Free tier exposes all models, enabling genuine speed evaluation before any payment commitment | Hardware (CS-3) pricing is opaque — enterprises cannot self-serve a cost estimate |
| No Nvidia dependency — pricing independent of GPU supply chain fluctuations | Rate limits on free tier can be a friction point for larger prototypes |
| Vertical integration from silicon to API enables proprietary speed optimizations | Model catalog turnover (ZAI-GLM-4.6 deprecated after ~6 months) suggests curation is ongoing and unannounced |
| OpenAI-compatible API allows drop-in replacement for applications already using OpenAI SDK | Limited enterprise features in the API layer: no organization-level access management, audit logs, or spend alerts documented |
Billing UX : Self-serve API keys with token-level usage metering
- Account creation — Sign up at cloud.cerebras.ai with email or Google OAuth; API key available immediately with no credit card required for the free tier.
- API compatibility — Cerebras Inference uses an OpenAI-compatible REST API (
/v1/chat/completions); developers can switch from OpenAI or other providers by changing the base URL and API key, with no code changes required for standard requests. - Billing activation — Moving from the free tier to the self-serve Developer tier (which starts at just $10) requires adding a payment method through the dashboard; no sales call is required. The Enterprise tier requires contacting sales.
- Usage metering — Token consumption is logged per API key in the cloud.cerebras.ai developer portal; breakdowns by model and time period are available.
- Spend alerts — Basic email notifications are available when credits run low; configurable hard spend caps are not publicly documented as of the last verified date.
- Payment methods — Credit card and wire transfer supported; enterprise contracts use invoiced net-30 payment terms.
- Rate limits — Free tier rate limits are enforced per API key; paid tier limits are higher but not publicly published in detail (operators report 60 requests/minute as a common paid baseline).
- Enterprise CS-3 billing — Hardware contracts are managed through a separate enterprise relationship; usage-based inference on managed CS-3 clusters is metered similarly to the cloud API but billed through a custom contract.
Strategic wins : Why Cerebras’s pricing decisions have worked
1. Positioning speed as the value metric, not capability
When Cerebras launched Cerebras Inference in August 2024, the inference API market was already crowded with providers offering access to the same Llama models. Rather than competing on model selection (where every GPU-cloud provider had the same catalog) or price (where margins are thin across the board), Cerebras competed on a dimension its hardware uniquely owned: speed. By demonstrating 2,100 tokens/second on Llama 3.1 8B — more than 20× faster than GPU alternatives — Cerebras created a product moment that no competitor could immediately replicate. The decision to charge at parity with slower competitors reinforced the narrative: “same price, radically faster.” This value-metric differentiation generated significant developer attention and organic distribution at launch.
2. Free tier with no credit card requirement lowered the friction barrier to zero
Cerebras made a deliberate choice to expose all public models on the free tier without requiring payment information. In a market where most inference providers require a credit card even for free-tier access (a practice that reduces developer friction but also reduces trust-building speed), Cerebras’s frictionless onboarding allowed any developer to experience the speed advantage in minutes. The rate limiting on the free tier is strict enough to push production workloads to paid, but loose enough that developers can genuinely evaluate latency in their specific application context. This PLG-first onboarding drove organic adoption through developer communities.
3. OpenAI API compatibility eliminated switching cost
By building Cerebras Inference as an OpenAI-compatible API, the company ensured that any developer already using the OpenAI SDK, LangChain, LlamaIndex, or other OpenAI-compatible frameworks could switch to Cerebras by changing exactly two lines of code: the base URL and the API key. This technical decision eliminated the switching cost objection entirely — a developer who wants to test Cerebras can do so against their existing application without any refactoring. The strategy mirrors how Groq and Together AI grew initial developer adoption and is now table stakes for new inference providers.
4. Dual product line de-risked the business during the IPO setback
Cerebras’s hardware business (CS-3 sales to national labs and research institutions) provided a stable revenue base that allowed the company to continue investing in Cerebras Inference even when the IPO was blocked in late 2024. Without the hardware contracts — particularly the multi-year government and research institution engagements — the company would have faced more acute pressure to monetize the inference API faster, potentially forcing premature pricing moves. The dual-track structure gave management time to build inference API momentum while hardware revenue sustained operations. This separation of revenue streams with different predictability profiles proved to be a structural advantage.
Areas to improve : Gaps in Cerebras’s pricing and platform approach
1. No enterprise-grade access control or spend management on the API
Cerebras Inference lacks organization-level features that enterprise buyers expect: team-based API key management, role-based access control, spend caps per team or project, and audit logs. A company deploying Cerebras Inference across multiple engineering teams has no mechanism to prevent a single team from consuming the full month’s budget through an errant batch job. This cost unpredictability gap is a meaningful barrier to enterprise API adoption. Competitors like Anthropic and OpenAI offer organization dashboards with per-project spend limits and usage visibility. Until Cerebras adds these controls, it is effectively limited to developer-grade and small-team deployment patterns in the API tier.
2. Model catalog is too narrow for multi-model use cases
Cerebras’s public per-token rate card has narrowed to just two models as of mid-2026 — GPT-OSS-120B and the Preview-only ZAI-GLM-4.7 — both open-source. A wider catalog (Llama, Qwen3, Mistral, DeepSeek, Kimi K2.x) exists only behind Dedicated Endpoints on custom pricing, so it is not self-serve. There is no access to Claude, GPT-4o, Gemini, or other frontier proprietary models through the Cerebras API. For organizations that want to consolidate their AI spend on a single inference platform, Cerebras cannot serve as a full-stack provider — it is a speed-optimized complement to, not a replacement for, platforms like Perplexity Sonar API or Fireworks AI. The narrow catalog also concentrates revenue risk on a small number of model relationships. Expanding the catalog — including through partnerships with model providers who would benefit from Cerebras’s speed for their open-weight models — is a strategic priority that has not yet materialized at scale.
3. Hardware pricing opacity creates a two-class developer ecosystem
The price opacity on CS-3 hardware creates an asymmetric developer experience: API users have a fully public rate card and can self-serve at any scale, while hardware prospects must engage sales before understanding costs. This bifurcation risks alienating the enterprise segment that Cerebras most needs to penetrate for long-term revenue growth. Publishing at least a floor price range or a “price per WSE-3 month” benchmark — even as a starting point for enterprise conversations — would help enterprises build preliminary business cases without requiring early sales engagement. Designing transparent enterprise pricing tiers is a solvable problem that Cerebras has not yet addressed.
Key takeaways
-
Speed can be a standalone value metric in a commoditized market. When every provider offers the same models at similar prices, a 10–20× throughput advantage is a genuine product differentiator that justifies category-level attention. Pricing teams competing in commoditized markets should ask what hardware or architectural advantages their infrastructure gives them that could be surfaced as a distinct pricing dimension.
-
Price-speed parity signals confidence; price premiums for speed signal weakness. Cerebras chose to price at parity with slower GPU-cloud competitors rather than charging a premium for faster throughput. This decision communicates confidence that speed adoption will be self-reinforcing — and it was. A speed premium would have created a price-performance comparison that competitors could close; price parity made the comparison entirely one-sided.
-
Free tiers should demonstrate the actual product, not a degraded version. Cerebras’s free tier exposes the same models at the same speed as the paid tier, just rate-limited. Developers experience the core value proposition — not a synthetic demo — before committing. For usage-based pricing products, free tiers that degrade quality (slower models, older versions) are less effective conversion drivers than volume-limited access to the full product.
-
Dual hardware-and-API business models create resilience but complicate focus. Cerebras’s hardware revenue sustained the company through the IPO setback, but the hardware sales cycle (multi-year government contracts) and the API sales cycle (self-serve minutes) require fundamentally different GTM motions, pricing architectures, and customer success approaches. Companies with dual B2B models should be deliberate about keeping these tracks operationally separate.
-
Revenue concentration in a single international customer is an IPO-blocking risk. The G42 relationship — which represented a significant fraction of Cerebras’s hardware revenue — triggered the CFIUS review that blocked the IPO. For any AI infrastructure company serving international customers, the concentration risk from a single large non-US customer is not just a financial risk but a regulatory and liquidity risk that must be disclosed and managed from early stages.
UBP implications
-
Per-token pricing at high throughput creates a new cost efficiency frontier for real-time applications. Cerebras demonstrates that usage-based pricing in AI inference does not have to accept the latency/cost tradeoff as fixed. When a provider can deliver faster inference at the same price, it unlocks new use cases (real-time voice, streaming code generation, interactive document analysis) that were previously uneconomic on GPU clouds. UBP practitioners building AI products should evaluate whether speed-gated features are worth pricing separately — Cerebras’s experience suggests that speed, at parity pricing, drives adoption without requiring a premium.
-
Symmetric vs. asymmetric token pricing reflects maturity in cost modeling. Cerebras’s evolution from symmetric ($0.10/$0.10 for Llama 3.1 8B) to asymmetric pricing (Llama 3.3 70B at $0.85/$1.20, GPT-OSS-120B at $0.35/$0.75) maps directly to better understanding of actual prefill versus decode compute costs on their hardware. Choosing the right usage metric for inference APIs means understanding that output generation is fundamentally more compute-intensive than input processing — a nuance that takes real usage data to optimize in a rate card.
-
The free-to-paid conversion cliff is rate limits, not model access. By offering the full model catalog on the free tier and differentiating only on throughput/rate limits, Cerebras has effectively made rate limits its value metric for the free-to-paid transition. This is a strong design choice for AI inference: rate limits are a natural usage signal (anyone hitting rate limits has a real workload), whereas model-gating would filter based on curiosity rather than genuine deployment intent. Operators designing free tiers for usage-based AI products should consider rate limits as the primary conversion gate rather than feature or model restrictions.
Sources
- Cerebras pricing page (tiers, per-token rate card, Cerebras Code plans) (accessed 2026-05-30)
- Cerebras Inference pricing documentation (accessed 2026-05-30)
- Cerebras Dedicated Endpoints — supported models (accessed 2026-05-30)
- Cerebras Inference models overview (accessed 2026-05-30)
- Cerebras cloud pricing page (accessed 2026-05-30)
- Cerebras blog — GPT-OSS-120B runs fastest on Cerebras (accessed 2026-05-29)
- Cerebras Systems GitHub — Cloud SDK Python (accessed 2026-05-29)
- Cerebras Systems S-1 filing (August 2024) (accessed 2026-05-29)
Bottom line
Cerebras has built the fastest publicly available LLM inference platform on the market by removing Nvidia GPUs from the equation entirely — and has priced it at parity with far slower GPU-cloud competitors. The result is a genuinely novel value proposition: the same open-source models you can run anywhere else, at 10–20× the throughput, at the same or lower cost. The inference API is clean, OpenAI-compatible, and frictionless to adopt. The gaps are real but fixable — the model catalog is narrow, enterprise controls are immature, and the hardware pricing opacity limits self-serve enterprise evaluation. Cerebras is a compelling speed-optimized inference tier for organizations running high-throughput workloads on open-source models; it is not yet a full-stack AI platform.
Browse the full pricing blueprint to compare Cerebras against other AI infrastructure providers.
Pricing timeline : Major events on a vertical axis
Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.
Public rate card narrows; Cerebras Code subscriptions launch
By mid-2026 the public per-token rate card lists just two models — GPT-OSS-120B ($0.35/$0.75, production) and ZAI-GLM-4.7 ($2.25/$2.75, labeled a Preview/evaluation model). Llama and Qwen3 families moved to Dedicated Endpoints on reserved-capacity custom pricing. Access is now tiered (Free, a self-serve Developer tier from $10, and Enterprise), and Cerebras introduced fixed-price Cerebras Code coding plans — Pro at $50/month (24M tokens/day) and Max at $200/month (120M tokens/day), both sold out at launch.
Qwen-3-32B and ZAI-GLM-4.x Models Added
Cerebras expanded its model catalog with Alibaba's Qwen-3-32B (priced at $0.40/$0.80 per million tokens) and the ZAI-GLM-4.6 and 4.7 models from Zhipu AI (priced at $2.25/$2.75 per million tokens). ZAI-GLM-4.6 was subsequently deprecated in January 2026.
GPT-OSS-120B Added — Fastest Open Reasoning Model
Cerebras added OpenAI's open-source GPT-OSS-120B (Apache 2.0 license) to its inference cloud at $0.35 input/$0.75 output per million tokens, claiming the fastest inference speed for a 120B-class reasoning model. The model supports a 131K context window.
IPO Blocked by CFIUS National-Security Review
Cerebras's planned IPO (S-1 filed August 2024, targeting ~$8B valuation) was blocked when CFIUS opened a national-security review of the company's relationship with UAE-based G42, which held a significant revenue concentration and had prior ties to Huawei. The company withdrew the IPO registration.
Cerebras Inference Launched in Public Beta — 2,100 Tokens/Second
Cerebras launched Cerebras Inference as a public beta cloud API, delivering Llama 3.1 8B and 70B models at speeds of 2,100 and 450 tokens/second respectively — exceeding GPU-cloud alternatives by 20×. The launch included a free developer tier and usage-based pay-per-token pricing.
WSE-3 and CS-3 Announced — 4 Trillion Transistors
Cerebras announced the third-generation Wafer Scale Engine (WSE-3) with 4 trillion transistors and 900,000 cores, and the CS-3 compute system built around it. CS-3 is positioned for both training and inference at scale. Pricing remains enterprise-contract only.
Cerebras Model Studio — First Cloud API
Cerebras launched Cerebras Model Studio, an early cloud-based API giving customers access to GPT-J and other open-source models running on WSE hardware. This was the company's first foray into cloud inference, initially available only to existing hardware customers.
CS-2 System Launched with WSE-2
Cerebras launched the CS-2 compute system powered by the WSE-2 chip (2.6 trillion transistors, 850,000 cores, 40 GB SRAM). CS-2 was sold to national labs, healthcare systems, and enterprises for large-scale model training.
WSE-1 Unveiled at Hot Chips — First Wafer-Scale AI Chip
Cerebras unveiled the Wafer Scale Engine (WSE-1) at Hot Chips 2019: 1.2 trillion transistors, 400,000 AI-optimized cores, 18 GB on-chip SRAM on a 46,225 mm² die. The chip was sold as part of the CS-1 compute system for on-premises deep learning training.
Cerebras Systems Founded
Andrew Feldman and Gary Lauterbach founded Cerebras Systems in Los Altos, California, to build a purpose-built AI chip that would break through GPU memory bottlenecks by placing all SRAM on a single wafer-scale die.
- · Cerebras's Wafer Scale Engine 3 (WSE-3) contains 4 trillion transistors on a single silicon wafer — roughly 57× more transistors than Nvidia's H100 GPU — making it the largest chip ever manufactured as of 2024.
- · Cerebras filed for an IPO in August 2024 valuing the company at approximately $8 billion, but the IPO was blocked in November 2024 when the Committee on Foreign Investment in the United States (CFIUS) opened a national-security review related to the company's largest customer, G42 of the UAE, which had previously had ties to Huawei.
- · At launch in August 2024, Cerebras Inference ran Llama 3.1 70B at 2,100 tokens per second — more than 20× faster than GPU-based competitors like Together AI or Fireworks AI at the time, a speed record that attracted significant developer attention.
Questions & answers
- How much does Cerebras Inference cost per million tokens?
- The public per-token rate card lists two models: GPT-OSS-120B at $0.35 input/$0.75 output per million tokens (production) and ZAI-GLM-4.7 at $2.25 input/$2.75 output per million tokens (a Preview model, intended for evaluation only, not production). Other models such as Llama 3.3 70B and Qwen3-32B are available via Dedicated Endpoints on custom pricing rather than the public rate card.
- Does Cerebras offer a free tier for the inference API?
- Yes. Cerebras provides a free tier with rate-limited access to all public models and Discord support. To raise limits, the self-serve Developer tier starts at just $10 and offers 10x higher rate limits and higher-priority processing; the Enterprise tier (contact sales) adds custom weights, dedicated queue priority, and guaranteed uptime.
- How fast is Cerebras inference compared to GPU-based providers?
- Cerebras Inference delivers 1,000–2,100 tokens per second on Llama 3.1 70B-class models, compared to 40–80 tokens/second on GPU-based providers like Together AI or Fireworks AI. The speed advantage comes from the on-chip SRAM of the WSE eliminating GPU memory bandwidth bottlenecks.
- What models are available on Cerebras Inference?
- As of mid-2026, the public per-token rate card lists GPT-OSS-120B (production) and ZAI-GLM-4.7 (a Preview/evaluation model). A much wider catalog — including Llama 3.3 70B, Llama 4 Maverick and Scout, Qwen3-32B and Qwen3-235B, Qwen3-Coder, Mistral, DeepSeek, and Kimi K2.x — is available through Dedicated Endpoints on reserved-capacity custom pricing. The catalog focuses on open-source models that benefit most from Cerebras's speed advantage.
- How does Cerebras hardware (CS-3) pricing work?
- The CS-3 compute system is sold under enterprise contracts via a direct sales process. Pricing is not publicly listed and varies based on deployment configuration (on-premises, cloud-connected, or managed), support tiers, and commitment length. Typical CS-3 deployments involve multi-year contracts at research institutions and national labs.
- Is Cerebras an alternative to Nvidia GPUs?
- For inference workloads on supported open-source models, yes. Cerebras Inference runs without Nvidia GPUs, delivering faster throughput at competitive per-token cost. For training arbitrary model architectures or running proprietary models, GPU-based infrastructure remains more flexible. Cerebras's hardware is optimized for dense transformer inference.