AI Summary
About
OctoAI — originally OctoML — was a generative-AI inference platform founded in 2019 as a University of Washington spinout commercializing the Apache TVM machine-learning compiler. Led by CEO Luis Ceze, the company first sold model-optimization and deployment tooling, then in late 2023 rebranded to OctoAI and pivoted into a hosted inference cloud: developers could call open models (Llama, Mixtral, Mistral, Stable Diffusion) through a usage-metered API at prices and speeds the hyperscalers were not yet matching. The pitch was “run any model, any hardware, fastest and cheapest,” and it raised roughly $132M at a reported ~$900M valuation in its 2021 Series C.
The story ends as a post-mortem. On September 25, 2024, NVIDIA acquired OctoAI — reportedly for about $165M (up to ~$250M with retention incentives), roughly 18 cents on the dollar against that ~$900M peak. Within weeks, OctoAI emailed customers that its commercial services would wind down effective October 31, 2024, giving developers about five weeks to migrate off the public text-gen and media-gen APIs. There was no successor “powered by OctoAI” product; CEO Luis Ceze moved to NVIDIA as VP of AI Systems Software, and the asset NVIDIA appeared to value most was OctoStack, OctoAI’s hardware-agnostic private-deployment layer. Today octo.ai redirects to NVIDIA.
Because the standalone platform no longer exists, everything below is historical — reconstructed from contemporaneous reporting and third-party pricing comparisons. None of it is purchasable today; for current options you would contact NVIDIA.
Pricing summary : How OctoAI’s pricing model worked
OctoAI was, while it operated, a pure usage-based inference platform — you paid for what you generated, not a per-seat subscription. There were three monetization surfaces:
- Text Gen (per token) — open LLM endpoints (Llama, Mixtral, Mistral, Code Llama) billed per 1M tokens, typically with the same input and output rate on smaller models. Self-serve, with new accounts getting free signup credit.
- Media Gen (per image / per compute-second) — Stable Diffusion XL, SD 1.5, and Stable Video Diffusion endpoints, usage-metered by image generated and underlying GPU compute rather than tokens. Unlike text-gen, it supported customer fine-tunes.
- OctoStack / dedicated compute — a private, self-hosted inference stack (and custom dedicated capacity) sold as sales-quoted enterprise contracts, with no public rate card.
What makes this different: the afterlife is the story. OctoAI is now a sales-only, NVIDIA-internal asset — the public self-serve rate card was retired in October 2024, and the platform was acquired and sunset in roughly five weeks. We classify it sales-only and treat all dollar figures as historical, because there is no live price to quote and presenting old rates as current would be misleading.
Pricing by product
These are historical (2023-2024) list rates, reconstructed from third-party reporting — not current prices. Text generation was billed per 1M tokens:
| Model (text-gen) | Input / 1M tokens | Output / 1M tokens | Notes |
|---|---|---|---|
| Llama 3 8B Instruct | $0.15 | $0.15 | Flat input/output rate |
| Llama 3 70B Instruct | ~$0.90 | ~$0.90 | Also reported at $0.765 each |
| Mixtral 8x7B Instruct | $0.30 | $0.50 | Split input/output |
| Mistral 7B Instruct | $0.10 | $0.25 | Split input/output |
| Text embeddings (GTE-Large) | $0.05 | — | Per 1M tokens |
Media generation (SDXL, SD 1.5, SVD) was usage-metered per image and/or per second of GPU compute rather than per token; OctoStack and dedicated compute were sales-quoted. New self-serve accounts received $10 in free credit.
Sales motions across products: historically self-serve/PLG for the token and image APIs (free credit, no sales call) with sales-led OctoStack and dedicated-compute contracts on top. Post-acquisition the entire platform is sales-only and folded into NVIDIA — there is no longer a self-serve motion to buy OctoAI standalone.
Hidden costs : What OctoAI users actually paid (and the real cost of the shutdown)
For a discontinued platform, the largest “hidden cost” is not a line item — it is migration risk. When OctoAI gave customers roughly five weeks to move off the API before the Oct 31, 2024 cutoff, teams that had hardcoded OctoAI endpoints and pricing into production bore the full re-platforming cost: re-pointing to a new provider, re-validating outputs, and absorbing whatever rate delta the replacement charged.
The historical metered costs that drove real bills were:
| Line item (historical) | How it was billed |
|---|---|
| Text-gen tokens | Per 1M tokens (e.g. Mixtral 8x7B at $0.30 input / $0.50 output) |
| Media-gen images | Per image / per second of GPU compute (SDXL, SD1.5, SVD) |
| Embeddings | $0.05 per 1M tokens |
| Free credit offset | $10 signup credit, then pay-as-you-go |
| OctoStack / dedicated | Sales-quoted contract (no public rate) |
Output tokens cost more than input on the larger split-rate models, so chat workloads with long generations skewed toward the output rate — the usual asymmetry that surprises teams modeling only the headline input price.
Want to estimate inference costs the way OctoAI customers had to? Use the OctoAI pricing calculator to model token and image spend, then compare against a live provider before you commit.
Pricing evolution : OctoAI pricing history and changes
Cadence
| Period | Price changes | Product / SKU additions | Notes |
|---|---|---|---|
| 2023 H2 | Per-token rate card published | Text Gen Solution; rebrand OctoML to OctoAI | $0.15 (Llama 3 8B) to ~$0.90 (70B); $10 free credit |
| 2024 H1 | — | OctoStack private deployment | Enterprise sales-quoted; hardware-agnostic |
| 2024 H2 | Rate card retired entirely | Platform sunset | NVIDIA acquired (Sep 25); APIs off Oct 31 |
Tracked range: 2023-2024 (the platform’s full commercial life). All prices historical; reconstructed from contemporaneous reporting — see 2026-06-15-main-validated.txt.
Notable changes
- November 2023 — OctoML rebrands to OctoAI and launches the per-token Text Gen Solution alongside its existing Media Gen (SDXL/SD1.5/SVD) endpoints. Self-serve, usage-based, $10 free credit. Headline text rates ran from $0.15 per 1M tokens (Llama 3 8B) up to about $0.90 (70B-class), with Mixtral 8x7B at $0.30 input / $0.50 output and Mistral 7B at $0.10 / $0.25.
- April 2024 — OctoStack launches: a self-hosted/private inference stack across NVIDIA, AMD, and AWS Inferentia hardware, claiming roughly 4x better GPU utilization. This shifted OctoAI’s enterprise story from “call our API” to “run our stack in your environment.”
- September-October 2024 — NVIDIA acquires OctoAI (~$165M-$250M reported) and winds the commercial platform down by Oct 31, 2024. The public rate card disappears; pricing becomes irrelevant because the product is no longer sold.
The trajectory is the lesson: a transparent, aggressively-cheap usage-based rate card was not enough to sustain an independent inference cloud once frontier-model economics and hyperscaler/NVIDIA gravity set in — the company was absorbed and its self-serve pricing erased within weeks.
What’s unique : OctoAI’s distinctive pricing mechanics
1. Two metering models under one platform. OctoAI ran both a per-token text-gen meter and a per-image / per-compute-second media-gen meter — pricing each modality on the unit that actually mapped to its cost, rather than forcing images into a token abstraction.
2. Hardware-agnostic enterprise pricing via OctoStack. Instead of only renting its own cloud by the unit, OctoAI sold a private deployment layer that ran across NVIDIA, AMD, and Inferentia — a sales-quoted contract whose value was utilization (the ~4x claim), not a published rate.
3. A rate card with a hard expiry. The most distinctive “mechanic” in hindsight is that the entire pricing surface was switched off on a fixed date after acquisition — a reminder that with a venture-backed inference startup, the rate card is only as durable as the company’s independence.
Strengths & weaknesses
| Strengths | Weaknesses |
|---|---|
| Transparent per-1M-token rates undercutting hyperscalers | Platform no longer exists — acquired and sunset |
| Free $10 credit lowered self-serve onboarding friction | Only ~5 weeks’ notice before the API went dark |
| Per-modality metering (tokens vs images) | Heavy migration cost dumped on production users |
| OctoStack: hardware-agnostic private deployment | Exited at ~18 cents on the dollar vs peak valuation |
| Fast SDXL endpoint with fine-tune support | No durable, independent pricing to rely on |
Billing UX : OctoAI billing controls and transparency
- Billing controls — Historically pay-as-you-go on metered usage (tokens / images), with a $10 free credit to start; OctoStack and dedicated compute were invoiced enterprise contracts. Today there are no self-serve billing controls because the standalone product is discontinued.
- Usage visibility — While live, the OctoAI console exposed per-model token and image usage; that dashboard is gone post-sunset.
- Payment options — Self-serve card billing for the metered APIs and sales-led invoicing for OctoStack/enterprise — now superseded by NVIDIA’s enterprise procurement, since OctoAI is sales-only and internal to NVIDIA.
Strategic wins : Why OctoAI’s pricing decisions worked (while they lasted)
1. Transparent, cheap per-token pricing as a wedge
By publishing flat per-1M-token rates (Llama 3 8B at $0.15) and handing out free credit, OctoAI made it trivial for developers to try open models without a sales call — the classic usage-based onboarding wedge. See how AI companies structure pricing.
2. Metering each modality on its real cost driver
Pricing text by the token and media by the image / compute-second meant customers paid on the unit that tracked OctoAI’s own GPU cost — a cleaner alignment than forcing everything into one abstraction. Related: outcome-based pricing trends.
3. Moving enterprise value to utilization, not list price
OctoStack repriced the enterprise conversation around GPU utilization (the ~4x claim) rather than a per-unit list rate — exactly the asset that made OctoAI attractive to NVIDIA. See choosing the right usage metric.
Areas to improve : Gaps in OctoAI’s pricing approach
1. Cheap usage rates could not fund an independent inference cloud
OctoAI priced to win developers, but per-token margins in a commoditizing inference market were thin against hyperscaler and NVIDIA scale — cheap rates were a great wedge and a poor moat. See bill shock and cost unpredictability.
2. No durability guarantee for customers’ pricing
A roughly five-week shutdown window left production users to absorb migration cost. Inference vendors that want trust need clearer continuity commitments around their rate card and endpoints.
3. Self-serve transparency, then a sudden sales-only cliff
OctoAI went from open, self-serve pricing to no pricing at all almost overnight after acquisition — a discontinuity that turned its earlier transparency into a liability for anyone who had standardized on it.
Key takeaways
- OctoAI was pure usage-based inference, now discontinued — per-1M-token text-gen, per-image media-gen, sales-quoted OctoStack — acquired by NVIDIA in Sept 2024 and sunset Oct 31, 2024. For the underlying model, see the introduction to usage-based pricing.
- The historical rate card was aggressively cheap — Llama 3 8B at $0.15, Mistral 7B at $0.10 / $0.25, Mixtral 8x7B at $0.30 / $0.50 per 1M tokens — built to win self-serve developers.
- Two meters, one platform — tokens for text, images/compute-seconds for media — each priced on its real cost driver.
- The biggest cost ended up being the shutdown — a roughly five-week migration window after the rate card was switched off entirely.
- Cheap usage pricing is a wedge, not a moat in commoditizing inference; the broader lesson for the category is that pricing transparency does not by itself sustain an independent vendor.
UBP implications
- Match the meter to the modality. OctoAI’s split of per-token text-gen and per-image media-gen is a reusable pattern: bill each product on the unit that maps to its underlying cost rather than forcing one abstraction across everything.
- A usage rate card is only as durable as the vendor. Buyers standardizing on a metered API should weigh continuity risk, because a startup’s published prices can vanish on an acquisition timeline measured in weeks.
- Transparent low rates win adoption but rarely fund independence. In a commoditizing inference market, cheap per-unit pricing is an excellent onboarding wedge and a weak long-term defense — a caution for any UBP business pricing below its scaled competitors.
Sources
- GeekWire — NVIDIA acquires OctoAI (accessed 2026-06-15)
- HPCwire / BigDATAwire — OctoAI Snapped Up by Nvidia (accessed 2026-06-15)
- eesel AI — What the NVIDIA acquisition of OctoAI means for you (accessed 2026-06-15)
- PRNewswire — OctoML Launches OctoAI Text Gen Solution (accessed 2026-06-15)
- Third-party LLM price comparisons (Medium, pricepertoken) for historical per-token rates (accessed 2026-06-15)
- OctoAI historical rate card, reconstructed in
2026-06-15-main-validated.txt(accessed 2026-06-15)
Bottom line
OctoAI (formerly OctoML) is a post-mortem, not a live pricing profile: a pure usage-based inference platform — cheap per-1M-token text generation, per-image media generation, and a sales-quoted OctoStack private-deployment layer — that NVIDIA acquired in September 2024 and shut down within weeks, retiring its rate card entirely by October 31, 2024. Its arc is the lesson the category keeps relearning: transparent, aggressively low usage pricing is a superb developer wedge but a poor moat, and a metered rate card is only as durable as the company behind it. Browse the pricing blueprint for more fully-researched company profiles, or compare OctoAI against other AI inference and infrastructure companies.
Want to compare OctoAI against other AI infrastructure companies? Browse the pricing blueprint.
Pricing timeline : Major events on a vertical axis
Each milestone below corresponds to a public pricing change, product launch, or material adjustment. Major events use a filled marker; minor adjustments use a faded one.
Standalone commercial platform sunset
OctoAI wound down its commercial services effective Oct 31, 2024, giving developers about five weeks to migrate off the public text-gen and media-gen APIs. No successor product; the published rate card was retired entirely.
Acquired by NVIDIA (~$165M-$250M)
NVIDIA acquired OctoAI for a reported ~$165M base (up to ~$250M with retention), roughly 18 cents on the dollar versus the ~$900M Series C peak. CEO Luis Ceze joined NVIDIA as VP of AI Systems Software.
OctoStack private-deployment tier added
OctoAI launched OctoStack — a self-hosted/private inference stack running in the customer's own environment across NVIDIA, AMD, and AWS Inferentia, claiming ~4x GPU utilization. Sales-quoted enterprise contract layered on top of the self-serve token/image APIs.
OctoML rebrands to OctoAI; launches per-token Text Gen
OctoAI launched its Text Gen Solution (Llama 2 Chat, Code Llama, Mistral) billed per 1M tokens, alongside an existing Media Gen Solution (SDXL/SD1.5/SVD) billed per image / per second of compute. Self-serve with $10 free credit; OctoML rebranded to OctoAI.
- · OctoAI began life as OctoML, a 2019 University of Washington spinout commercializing the Apache TVM compiler project before pivoting into a hosted generative-AI inference platform.
- · NVIDIA reportedly paid about $165M (up to ~$250M with retention) — roughly 18 cents on the dollar versus the ~$900M valuation OctoAI raised at in its 2021 Series C.
- · After the acquisition NVIDIA gave developers only about five weeks to migrate before the public API went dark on Oct 31, 2024 — there was no successor 'powered by OctoAI' product.
Questions & answers
- Can I still buy OctoAI today?
- No. NVIDIA acquired OctoAI (formerly OctoML) on September 25, 2024, and OctoAI wound down its commercial services effective October 31, 2024 — about a five-week migration window. The public text-generation and media-generation APIs were discontinued, octo.ai now redirects to NVIDIA, and there is no successor 'powered by OctoAI' product. Any pricing you find online for OctoAI is historical.
- How did OctoAI price text generation?
- OctoAI charged per 1M tokens, usually with the same rate for input and output on smaller models. Historically (2023-2024): Llama 3 8B at $0.15, Llama 3 70B around $0.90, Mixtral 8x7B at $0.30 input / $0.50 output, Mistral 7B at $0.10 input / $0.25 output, and text embeddings at $0.05 per 1M tokens. New accounts received $10 in free credit.
- How did OctoAI price image and media generation?
- Media Gen (Stable Diffusion XL, SD 1.5, and Stable Video Diffusion) was usage-metered — billed per image generated and/or per second of GPU compute rather than per token. OctoAI marketed the 'fastest SDXL endpoint' at roughly 3.1 seconds average latency, and unlike its text-gen product it supported customer fine-tunes.
- What was OctoStack and how was it priced?
- OctoStack, launched in April 2024, was OctoAI's private/self-hosted inference stack for enterprises — it ran in the customer's own environment across NVIDIA, AMD, and AWS Inferentia hardware and claimed about 4x better GPU utilization. It was sold as a sales-quoted enterprise contract with no public rate card, and it was the asset NVIDIA was most interested in.