Data Pipeline Pricing: Examples & Companies

What is it

Data Pipeline Pricing is pricing for data collection, scraping, and pipeline services — platforms that extract, transform, and deliver web data, typically billed per request, per GB, or per record.

Thirty-nine companies in the UsagePricing corpus tag data-pipeline as a use case, making it one of the densest single clusters in the collection. Read broadly, the category is the full supply chain that feeds data to the AI stack, and it spans five layers. At the collection layer sit the proxy-and-scraping vendors (Bright Data, Oxylabs, ScraperAPI, ZenRows), the AI-agent-facing extraction and search APIs (Firecrawl, Exa, Tavily, Linkup, SerpApi), the no-code scrapers (Browse AI, Apify), and structured-extraction platforms (Diffbot, Unstructured, LlamaIndex).

At the transform and orchestration layer are the workflow engines that scrape-transform-load in a single run (n8n, Trigger.dev, Pipedream), the GTM enrichment tools (Clay, Rows), and the metering layer itself (OpenMeter). At the storage-and-retrieval layer are the vector databases that index the embeddings a pipeline produces (Pinecone, Weaviate, Qdrant, Chroma, Milvus, LanceDB, turbopuffer, Upstash). A training-data layer supplies human-labeled and programmatic data (Scale AI, Labelbox, Snorkel AI, Mercor, micro1). And beneath all of them is the compute substrate the heaviest pipelines run on (Modal, RunPod, Anyscale, Lightning AI, Together AI, Mosaic AI).

What unites this sprawl is a near-total commitment to usage-based metering; almost none of these companies charge per seat. Each picks the unit that tracks where its cost — and the customer’s value — actually sits: a gigabyte of proxy bandwidth, a successful scrape, a delivered record, a rendered page, a stored vector, a workflow execution, or a GPU-hour. The leading scrapers go further and charge only when a request succeeds, eating the cost of blocks and CAPTCHAs themselves. Sitting at the supply end of the AI stack — feeding grounding data, enrichment, embeddings, and training corpora to everything else — this cluster’s pricing is an early read on how the broader market meters raw, variable-cost work. For the underlying mechanics, see the introduction to usage-based pricing.

One pipeline, four meters — pick the unit before the tier

How it works

Data pipeline pricing resolves to one core decision: which unit best tracks the cost of doing the work. The corpus uses several primary meters, and the larger platforms run more than one in parallel rather than forcing a single unit across an entire catalog.

Billing unit	What it tracks	Best fit	Example on this page
Per request / per 1,000 requests	Count of API calls	Search and structured extraction	Exa Search at $7/1k requests (~$0.007/call)
Per GB	Bandwidth transferred	Raw proxy traffic where payload size drives cost	Oxylabs residential from $6/GB, datacenter from $0.59/GB
Per record / per page / per 1,000 results	Rows, pages, or successful results delivered	Packaged datasets, document ETL, success-based scraping	Unstructured at a flat $0.03/page; Bright Data datasets per 1,000 records
Credits (difficulty-weighted)	Normalized request difficulty	Scrapers where some pages cost far more to fetch	ScraperAPI 1 credit (plain) → 75 credits (ultra-premium + render)
Read/write units + storage	Vector index operations and bytes at rest	Vector databases indexing embeddings	Pinecone ~$16–18/M read units, ~$4–4.50/M write units, ~$0.33/GB/mo storage
Per execution / per second	A whole job or run	Workflow orchestration	Trigger.dev at $0.0000169–$0.00068/sec + $0.25 per 10,000 runs

Three structural levers recur across the category. The first is success-based billing: Bright Data, Oxylabs, SerpApi, and ZenRows do not charge for requests that fail with a block, CAPTCHA, or 5xx/6xx system error. The vendor absorbs proxy churn; the buyer pays only for data actually returned. The second is the difficulty multiplier: ScraperAPI’s credit costs 1 for a plain page but 10 for JavaScript rendering or premium proxies and 75 for ultra-premium-plus-render, while Oxylabs varies its per-1,000-result rate by target (Amazon $0.50, Google $1.00, other $1.15 without JS rendering). Both encode the reality that not all requests cost the same to serve. The third is the multi-line catalog, where storage-and-retrieval vendors like Pinecone split one product into read units, write units, and per-GB storage so each cost driver is priced independently.

A worked example shows how the meter choice changes the math. Take 100,000 page fetches in a month:

Unit math (per-credit, difficulty-weighted): On ScraperAPI’s 100,000-credit Hobby plan ($49/mo), 100,000 plain pages cost 100,000 credits — exactly the plan. But if every page needs JS rendering (10 credits each), the same 100,000 fetches need 1,000,000 credits, forcing a far larger tier. The headline credit count overstates real capacity the moment multipliers apply.

Unit math (per-credit, page-flat): On Firecrawl, one credit ≈ one page, so 100,000 pages ≈ 100,000 credits — covered by the $83/mo Standard tier (100,000 credits). Because Firecrawl charges no per-seat fee, a 50-person team pays the same $83 as a solo developer at that volume.

Unit math (per-page ETL): On Unstructured’s flat $0.03/page, turning those same 100,000 fetched documents into clean structured JSON costs $3,000 — a linear, seat-free bill that starts only after the 15,000 free non-expiring pages are used up.

The fourth axis, easy to miss, is throughput as a separate price dimension. ScraperAPI caps concurrent threads (20 → 500) independently of the credit budget, and Firecrawl raises concurrent browsers and rate limits with each tier — so two customers with identical volume allowances can have very different speed ceilings. Choosing the right unit and tier requires matching all of them: volume, difficulty mix, storage footprint, and concurrency. The choosing-the-right-usage-metric guide walks through that selection.

Companies using this

These 39 corpus companies tag data pipelines as a use case, spanning collection, transform, storage-and-retrieval, training-data, and compute layers. The table below lists each with its product, pricing model, billing units, free-tier status, and last-verified date — sortable and filterable.

Patterns observed

Multi-meter catalogs are the norm at the top of the market. The largest platforms deliberately run several billing units at once. Bright Data is the cleanest example in the entire corpus: rotating residential and its Browser API bill per GB, static ISP and datacenter proxies bill per dedicated IP, the unblocker/SERP/scraper APIs bill per 1,000 successful results, and datasets bill per 1,000 records — four simultaneous meters on one platform. Oxylabs mirrors the structure with three value metrics (GB, IP, successful results) across seven product lines, and vector store Pinecone does the storage-side equivalent by splitting read units, write units, and storage into three independent rates. The upside is that margins stay legible; the downside is that cross-product cost forecasting becomes genuinely hard for buyers.

Success-based billing is a category differentiator, not an edge case. Bright Data, Oxylabs, SerpApi, and ZenRows all decline to charge for failed requests — blocks, CAPTCHAs, and system errors are on the vendor. SerpApi takes the posture furthest: it has no overage line at all, so exhausting a plan triggers an early full-price renewal rather than a marginal per-search charge. This outcome alignment is the same logic driving the broader shift toward outcome-based pricing in AI, arriving early here because block rates make “pay per attempt” feel unfair.

Difficulty-weighting normalizes the cost of uneven work. Because a JavaScript-heavy, bot-protected page costs dramatically more to fetch than a static one, several vendors price the difficulty rather than the raw count. ScraperAPI’s credit multiplier (1× / 10× / 75×) and Oxylabs’ target-specific per-1,000 rates (Amazon $0.50, Google $1.00, other $1.15) both do this. The pattern keeps unit economics honest but makes headline allowances misleading — a 100,000-credit plan is not 100,000 hard scrapes.

The storage-and-retrieval layer meters bytes and operations, not fetches. The vector databases in this cluster tag data-pipeline because they hold what a pipeline produces, and they price accordingly. turbopuffer bills per write, per query, and per GB-month of storage above monthly minimums of $64 / $256 / $4,096. Pinecone charges ~$16–18 per million read units and ~$0.33/GB/mo on Standard, with a $20 flat Builder tier below it. Qdrant meters vCPU, RAM, and storage billed hourly. The unit shifts from “did I fetch a page” to “how much am I storing and querying” — a reminder that a single use case can meter very different cost drivers depending on where in the pipeline the company sits.

Seats are nearly absent, and orchestration meters the job. This is among the least seat-based clusters in the corpus. Firecrawl explicitly never charges per user; Exa, Tavily, and Linkup are pure pay-as-you-go credit balances with no seat line at all. The orchestration engines go further and meter the whole run: n8n bills monthly workflow executions ($20/mo up to $800/mo), Trigger.dev bills per second of compute plus $0.25 per 10,000 run invocations, and Pipedream bills credits where one credit ≈ 30 seconds at 256 MB. The buyer is a developer or an automated agent, not a team of named human users, so the value metric is throughput and jobs, not headcount.

Falling unit prices on the collection layer. Oxylabs’ residential entry rate roughly halved from $12/GB in 2022 to $6/GB in 2026, and Apify is the rare metered platform that has cut prices — Scale from $499 to $199, Starter from $49 to $29, and compute-unit rates ~20–25% in 2025. Search APIs are mixed: Exa’s base Search rate fell to $5/1k in 2025 then rose to $7/1k in 2026. The direction is not uniform, which is exactly why buyers should re-baseline at least twice a year.

The compute substrate appears on the supply side. Modal, RunPod, Anyscale, Lightning AI, Together AI, and Mosaic AI tag data-pipeline not because they scrape, but because heavy pipelines run their extraction, embedding, and training jobs on these platforms — metered in GPU-hours, GB-hours, or per-token, never per record. Together AI’s dedicated GPU rates ($6.49/hr, $3.59 reserved, B200 at $11.95/hr) show how the same use case reaches down into raw compute pricing at the bottom of the stack.

Counterexamples & variants

Sales-led, unpriced data work breaks the self-serve mold. Not every data-pipeline company exposes a meter. Scale AI is enterprise sales-quoted on contract-based per-task or per-data-unit deals with committed data-engine spend; Snorkel AI sells annual platform subscriptions with no public self-serve rate card; and micro1 (a human-data engine and RL-environment provider) and Mercor (an AI talent marketplace plus enterprise data partnerships) are both sales-quoted with no public price — Mercor’s buyer take-rate is undisclosed and only the hourly expert pay is visible. When the “pipeline” is bespoke human-labeled or programmatic training data rather than automated web extraction, usage metering gives way to custom enterprise contracts. These are the clearest cases where the category’s default self-serve model does not apply.

Subscription-with-credit-wallet, not pure usage. Apify and Diffbot layer a flat monthly plan over a prepaid credit pool — Apify’s prepaid balance equals its plan fee ($29 of Starter buys $29 of usage) and expires at cycle end. Trigger.dev does the same on the orchestration side: its Free/Hobby/Pro base fees ($0/$10/$50) double as a prepaid credit balance against compute. These are hybrids, not pure usage, and they introduce a use-it-or-lose-it dynamic absent from balance-based vendors like Exa or Tavily. The variant is worth flagging because it changes the buyer’s optimization problem from “minimize spend” to “right-size the wallet.” See the prepaid-credits guide for how these pools behave.

Execution-metered automation is a different unit entirely. n8n prices on monthly workflow executions — not requests, GB, or records — because a workflow may scrape, transform, and load in a single billed run, and its free self-hosted Community Edition also breaks the cluster’s freemium-SaaS pattern. Pipedream meters credits where one credit is roughly 30 seconds of compute at 256 MB, memory-scaled, and even adds a $2-per-external-user line on its Connect tier — one of the few seat-adjacent charges in the whole cluster. Clay splits its meter again into an Actions capacity tier plus a separate Data Credits usage pool. All three show that “data pipeline” at the orchestration layer meters the job, not the byte.

Flat, dimensionless unit pricing. Against the difficulty-weighted and multi-meter norm, Unstructured prices document ETL at a flat $0.03 per page — any file type, any pipeline, no multipliers, with 15,000 free non-expiring pages. It is the rare vendor here whose per-artifact rate never varies with difficulty, which makes it trivially forecastable but leaves margin exposed on genuinely hard documents. This is the mirror image of the ScraperAPI approach: legibility over cost-precision.

KYC and legal gating inside self-serve. Bright Data requires a know-your-customer review (sometimes a video call) before residential and mobile networks switch on — a manual compliance checkpoint inside an otherwise instant, card-on-file product. SerpApi goes the other way and productizes the legal risk, bundling an up-to-$2M U.S. Legal Shield rather than charging for it. Both are reminders that in web data, compliance is a pricing-adjacent variable, not just a footnote.

What this means for buyers vs vendors

For buyers

Model your bill on your difficulty mix, not the headline allowance. A 100,000-credit plan on ScraperAPI is 100,000 plain pages but only ~1,333 ultra-premium-plus-render scrapes; on Oxylabs the per-1,000-result rate changes with the target. Estimate the share of your traffic that needs JS rendering or premium proxies before picking a tier, and test-run heavy jobs — Apify’s compute-unit cost is impossible to forecast without one. For predictable document ETL, Unstructured’s flat $0.03/page is the easy case: it scales linearly with volume and nothing else.

Prefer success-based vendors for hard targets. If you scrape bot-protected sites, Bright Data, Oxylabs, SerpApi, and ZenRows won’t charge you for the blocks — which can be a large fraction of attempts. Watch the meter that dominates: on protected targets, residential-proxy GB can dwarf the plan fee, the single biggest source of bill shock in this category. The pricing calculator hub helps you sanity-check tier choices against expected volume before you commit.

Budget the storage layer separately, and check the wallet rules. If your pipeline ends in a vector store, that bill lives on a different meter: Pinecone charges ~$0.33/GB/mo plus read and write units, and turbopuffer imposes a $64/mo minimum before per-write and per-query charges even begin — costs that compound with index size, not fetch volume. On the collection side, Apify and Firecrawl credits don’t roll over (except via auto-recharge or annual), so over-provisioning is pure waste; balance-based vendors like Exa and Linkup let unused credit sit. If your volume is spiky, a pay-as-you-go balance beats a fixed monthly pool.

For vendors

Match the meter to the cost driver, then keep it legible. The corpus winners run multiple meters (Bright Data’s GB / IP / results / records; Pinecone’s read / write / storage) precisely because no single unit fits a proxy, a scrape, a dataset, and a stored index — but every additional meter raises the buyer’s forecasting burden. The trade is margin clarity for budgeting friction; price the dimensions that genuinely diverge in cost and bundle the rest. Unstructured’s flat per-page rate shows the opposite bet — maximum legibility at the cost of precision — and it works because document ETL cost is relatively uniform.

Success-based billing is a trust lever worth the cost. Eating failed requests differentiates you from resellers and removes the buyer’s biggest objection on hard targets. Pair it with difficulty-weighting so you don’t lose money on the expensive scrapes — that’s the ScraperAPI and Oxylabs playbook. And resist drift toward per-attempt billing; in web data it reads as charging for your own failures.

Use throughput, concurrency, and the job itself as clean up-sells. ScraperAPI (20 → 500 threads) and Firecrawl (concurrency per tier) both sell speed as a second axis independent of volume, capturing willingness-to-pay from latency-sensitive accounts without raising the per-unit rate. The orchestration engines (n8n, Trigger.dev) go further and meter the run outright, which lets them price the compound work of a whole pipeline in one line. For the mechanics of metering and invoicing this volume of events, see the usage-invoicing and billing-cycles guide.

Company	Product	Pricing model	Billing units	Free tier	Verified
Anyscale	Managed Ray platform for distributed AI training, inference, and batch processing (RayTurbo, Anyscale Compute Units)	pure-usage commitment hybrid	gpu-hours cpu-hours credits	Yes	2026-05-29
Apify	Apify Platform — web scraping and browser-automation cloud with an Actors marketplace	hybrid freemium	gb-hours credits bandwidth-gb	Yes	2026-06-03
Bright Data	Web data platform — proxy networks, scraping APIs, a managed scraping browser, SERP and unlocker APIs, ready-made datasets, and eCommerce insights	pure-usage hybrid commitment	bandwidth-gb requests records	Yes	2026-07-14
Browse AI	No-code web scraping and website-monitoring platform that turns any site into a structured dataset or API	freemium hybrid commitment	credits seats	Yes	2026-06-04
Chroma	Open-source vector database + Chroma Cloud	pure-usage freemium	storage-gb bandwidth-gb api-calls	Yes	2026-06-09
Clay	AI-powered GTM data-enrichment and outbound platform billed on Actions plus Data Credits	hybrid freemium commitment	credits actions	Yes	2026-07-06
Databricks (Mosaic AI)	Mosaic AI — enterprise GenAI & ML on the Data Intelligence Platform	pure-usage commitment	units tokens gpu-hours	Yes	2026-06-15
Diffbot	Web-extraction APIs (Extract, Crawl, Natural Language) plus a Knowledge Graph, metered on monthly credits	hybrid freemium	credits api-calls	Yes	2026-06-04
Exa	AI web search API for agents — search, contents, deep research, and monitoring endpoints billed per request	pure-usage freemium	requests credits api-calls	Yes	2026-07-14
Firecrawl	Web-scraping and data-extraction API for AI agents — scrape, crawl, map, search, and extract pages into clean markdown/JSON	subscription hybrid freemium	credits pages-rendered api-calls	Yes	2026-06-30
Labelbox	AI training-data platform (data labeling, curation & model evaluation)	pure-usage freemium subscription	units records data-licensing	Yes	2026-06-15
LanceDB	AI-native multimodal lakehouse	freemium pure-usage commitment	storage-gb vectors-indexed gpu-hours	Yes	2026-06-09
Lightning AI	Cloud GPU/CPU Studio compute platform for building, training, and serving AI models, billed by the second with a credit pool.	hybrid freemium pure-usage	gpu-hours cpu-hours credits	Yes	2026-06-02
Linkup	Web search API for AI agents — Search, Fetch, and async Research endpoints with grounded, structured results	pure-usage freemium	requests credits api-calls	Yes	2026-07-14
LlamaIndex	RAG/agent orchestration framework + LlamaCloud document parsing	hybrid freemium	credits pages-rendered seats	Yes	2026-06-10
Mercor	AI talent marketplace + enterprise data partnerships for frontier AI labs	pure-usage	tasks	No	2026-07-14
micro1	Human-data engine, RL environments, and agent evaluation for frontier AI labs	pure-usage	tasks	No	2026-07-14
Milvus	Vector database (OSS) + Zilliz Cloud (managed)	pure-usage freemium commitment	gpu-hours storage-gb vectors-indexed	Yes	2026-06-09
Modal	Serverless compute and GPU platform — per-second billing for Python functions, batch jobs, and model serving	pure-usage freemium subscription	gpu-hours cpu-hours gb-hours	Yes	2026-07-14
n8n	Fair-code workflow automation platform for technical teams, billed by monthly workflow executions	subscription freemium	workflow-executions	Yes	2026-06-02
OpenMeter	Open-source usage metering and billing platform for AI, agentic, and developer tools	freemium	events api-calls	Yes	2026-06-03
Oxylabs	Web data collection: residential, datacenter, ISP & mobile proxies plus Web Scraper API and Web Unblocker	hybrid pure-usage freemium	bandwidth-gb ips records	Yes	2026-07-06
Pinecone	Managed vector database (serverless)	pure-usage hybrid	requests storage-gb vectors-indexed	Yes	2026-06-09
Pipedream	Workflow automation and integration platform for developers	hybrid freemium	credits workflow-executions tokens	Yes	2026-06-16
Qdrant	Open-source vector database + Qdrant Cloud	pure-usage freemium	cpu-hours gb-hours storage-gb	Yes	2026-06-09
Rows	Rows AI spreadsheet	subscription hybrid	seats tasks api-calls	Yes	2026-06-08
RunPod	GPU cloud marketplace — Secure Cloud and Community Cloud Pods, Serverless endpoints, and persistent storage	pure-usage hybrid commitment	gpu-hours storage-gb	No	2026-07-14
Scale AI	Data engine, GenAI platform & contributor marketplace	pure-usage commitment	tasks records data-licensing	No	2026-06-15
ScraperAPI	Web scraping API that handles proxies, browsers, and CAPTCHAs behind a single endpoint	subscription pure-usage	credits requests api-calls	No	2026-06-04
SerpApi	Real-time search-results API (Google, Bing, and other engines)	subscription pure-usage	api-calls requests	Yes	2026-06-04
Snorkel AI	Programmatic AI data development platform & expert data	subscription commitment	data-licensing records units	No	2026-06-15
Tavily	Tavily Search API	pure-usage freemium	credits api-calls requests	Yes	2026-06-03
Together AI	AI Acceleration Cloud — serverless inference, dedicated endpoints, GPU clusters, Code Sandbox, fine-tuning	pure-usage hybrid commitment	tokens gpu-hours cpu-hours	Yes	2026-07-14
Trigger.dev	Background jobs and workflow orchestration for developers	hybrid freemium pure-usage	workflow-executions cpu-hours seats	Yes	2026-06-16
turbopuffer	Serverless vector and full-text search database on object storage	pure-usage commitment	storage-gb vectors-indexed gb-hours	No	2026-07-14
Unstructured	Document ingestion / ETL API	pure-usage freemium	pages-rendered documents	Yes	2026-07-14
Upstash	Upstash (Redis, Vector, QStash, Search, Workflow)	pure-usage freemium hybrid	requests api-calls vectors-indexed	Yes	2026-07-14
Weaviate	AI-native vector database (open-source core + Weaviate Cloud managed serverless, dedicated/Enterprise Cloud, BYOC)	pure-usage hybrid commitment	vectors-indexed tokens api-calls	Yes	2026-07-06
ZenRows	Universal Scraper API, Scraping Browser, and Residential Proxies	hybrid subscription pure-usage	requests api-calls bandwidth-gb	Yes	2026-06-04

Explore this theme in the knowledge graph

FAQ

What is data pipeline pricing?

Data pipeline pricing is how vendors that collect, extract, transform, and deliver data charge for it — typically metered per request, per gigabyte transferred, per record or page returned, per vector stored, or per credit. The unit is chosen to track the actual cost driver of the workload, which is why a platform like Bright Data may run four meters at once while a vector store like Pinecone bills read units, write units, and storage separately.

Why do web scraping vendors bill only for successful results?

Blocks, CAPTCHAs, and proxy churn are the vendor's problem to solve, not the buyer's. Success-based billing — used by Bright Data, Oxylabs, SerpApi, and ZenRows — means a failed fetch isn't charged, aligning the vendor's incentive with the customer's and differentiating against resellers who bill all traffic regardless of block rate.

What's the difference between per-request, per-GB, and per-record billing?

Per-request (Exa, SerpApi, ScraperAPI) charges by call and suits search and structured extraction; per-GB (Bright Data, Oxylabs residential proxies) charges by bandwidth and suits raw proxy traffic where payload size drives cost; per-record or per-page (Bright Data datasets, Unstructured at $0.03/page) charges by artifact delivered and suits packaged datasets and document ETL. Many platforms mix several across product lines.

How do vector databases fit the data-pipeline category?

Vector databases are the storage-and-retrieval end of the AI data pipeline. Pinecone, Weaviate, Qdrant, Chroma, Milvus, and turbopuffer index the embeddings that scraping and extraction produce, and they meter storage plus read/write operations — Pinecone bills read units, write units, and ~$0.33/GB/mo storage; turbopuffer bills per write, per query, and per GB-month above a $64/mo minimum.

Are web data prices rising or falling?

Residential proxy unit prices have fallen sharply — Oxylabs' residential entry rate roughly halved from $12/GB in 2022 to $6/GB in 2026. Apify has cut plan and compute-unit prices. Search API rates are mixed: Exa's base Search rate dropped to $5/1k in 2025 before rising to $7/1k in 2026. Re-baseline your cost model at least twice a year.

Do data pipeline vendors charge per seat?

Rarely. The category is overwhelmingly usage-metered, not seat-based. Firecrawl explicitly never charges per user — a solo developer and a 50-person team pay the same as long as page volume and concurrency match — and search APIs like Exa, Tavily, and Linkup are pure pay-as-you-go credit balances with no seat line at all.

Related use cases

Back to companies