DeepSeek V4 Pro's Discount Goes Permanent — The Real Cost of Inference

A discount that stayed

A temporary discount that was supposed to expire is now permanent. That’s either a land-grab for market share or a signal that the original price was always padded.

DeepSeek just turned their V4 Pro introductory pricing into the new baseline. The silence from the rest of the AI API market is louder than any press release.

I’ve been shipping production systems long enough to know that permanent discounts from infrastructure providers usually mean one of two things: they found a cost advantage nobody else has, or they’re buying market share they can’t hold on price alone.

The question is which one this is.

Key Takeaway: Permanent pricing removes the “will they jack up rates next quarter” fear that kills enterprise procurement deals. Temporary discounts poison long-term planning. Permanent ones reveal strategy.

What the pricing page actually says

The DeepSeek API pricing page (api-docs.deepseek.com/quick_start/pricing) is the canonical reference. It’s a straightforward table showing input tokens, output tokens, and cache hit rates, updated to reflect the new permanent pricing that replaces what was previously marketed as a limited time launch discount.

This matters because DeepSeek has been the one enforcing price discipline since V2. Every time OpenAI or Anthropic adjusts pricing, the market watches to see if DeepSeek follows. When DeepSeek cuts and holds, everyone else feels the margin pressure.

The timing lands right as enterprises are moving from “which model is best on a benchmark” to “which model can I run at scale without burning my infrastructure budget.” That’s exactly the conversation DeepSeek wants to lead.

“For all models, the input cache hit price has been reduced to 1/10 of the launch price. This price adjustment takes effect from 2026/4/26 12:15 UTC.”

— DeepSeek API Pricing Docs

That quote matters. It’s not a promotion. It’s a permanent structural discount.

The real cost picture

Here’s what the pricing page actually says. DeepSeek V4 Pro now sits at permanent pricing that was previously a temporary launch discount. The numbers are aggressive, and they land in a very different market than the GPT-4o vs Claude 3.5 era:

Model	Input (per 1M tokens)	Cached Input	Output (per 1M tokens)
DeepSeek V4 Flash	$0.14	~$0.003	$0.28
Kimi K2.6	~$0.90	~$0.15	~$3.75
DeepSeek V4 Pro	$2.00	$0.50	$8.00
GPT-5.4	$2.50	$0.25	$15.00
Claude Sonnet 4.6	$3.00	$0.30	$15.00
GPT-5.5	$5.00	$0.50	$30.00

Let me level with you: the raw price comparison isn’t the full story. Here’s what actually matters when you’re choosing a model to build an application or an agent around, not just comparing spec sheets.

The cache mechanism auto-detects repeated prefixes and serves cached results at the $0.50 rate. For production workloads with predictable prompt structures like RAG pipelines, code generation scaffolding, and agentic loops, that’s where the real savings live.

Here’s the practical breakdown for builders, not buyers. DeepSeek V4 Flash at $0.28/MTok output and $0.14 input is in a tier of its own. It’s not a direct competitor to the frontier models, it’s a different category entirely. For high volume, latency tolerant workloads like classification, extraction, summarization, and single turn RAG, it’s absurdly cheap. DeepSeek V4 Pro sits in the sweet spot: cheaper than GPT-5.4 and Sonnet 4.6 on both input and output, with a 1M context window. GPT-5.5 is the premium tier at $30/MTok output, and that’s real money if you’re running agent loops that generate thousands of tokens per turn. Kimi K2.6 is the wildcard: output at roughly half of DeepSeek V4 Pro’s price, but you’re betting on a Chinese provider’s API reliability and data governance. Fine for some workloads, a non-starter for regulated industries.

For agentic workloads specifically, multi turn loops where every turn generates output tokens, the output price matters more than the input price. DeepSeek V4 Pro at $8/MTok output is nearly 2x cheaper than GPT-5.4 and Sonnet 4.6, and nearly 4x cheaper than GPT-5.5. In an agent loop averaging 2,000 output tokens per turn across 10 turns, that’s the difference between $0.16 and $0.30 per session on output alone. Multiply by thousands of sessions and it adds up fast.

Kimi K2.6 is genuinely interesting here: at ~$3.75/MTok output, it’s cheaper than everyone. But the ecosystem questions like tool use quality, streaming reliability, and rate limits under load are still open. I haven’t run K2.6 in production yet, so I can’t vouch for it the way I can vouch for DeepSeek.

"DeepSeek V4 Pro at $8/MTok output isn’t just cheap — it changes the architecture math. When your output cost is 2x cheaper than the nearest competitor, you can afford to run more agent turns, more retries, more fallback chains. Cheap output changes how you design the loop."

— Dusty

“The deepseek-v4-pro model API pricing will be officially adjusted to 1/4 of the original price after the 75% discount promotion ends on 2026/05/31 15:59 UTC.”

— DeepSeek API Pricing Docs

Read the quote again and notice what’s really there. DeepSeek isn’t just publishing rates, they’re publishing infrastructure economics. This is a company treating inference as an engineering cost to be optimized rather than a margin to be maintained.

DeepSeek V4 Pro pricing comparison showing $2/$8 across the board against GPT-5.4, Sonnet 4.6, GPT-5.5, and Kimi K2.6

Bottom Line

The real cost driver isn’t the headline per-token rate, it’s your cache hit rate. A 60% cache hit on V4 Pro drops your effective input cost to $1.10 per million tokens. That’s 56% cheaper than GPT-5.4’s uncached input and 78% cheaper than GPT-5.5’s.

Three projects, 150M tokens

I’ve been running V4 Pro in production across three client projects. One is a code generation pipeline for a fintech. Another is a document analysis system for legal tech. The third is an internal agentic workflow for logistics. Combined: roughly 150 million tokens.

Here’s what the real cost picture looks like:

# Real-world daily cost: fintech code gen pipeline
# Volume: 5M input tokens/day, 60% cache hit rate

Cache miss:  2,000,000 tokens x $2.00/1M   = $4.00
Cache hit:   3,000,000 tokens x $0.50/1M   = $1.50
Output:      1,500,000 tokens x $8.00/1M   = $12.00
─────────────────────────────────────────────────
Total:                                       $17.50/day

That’s a 56% savings compared to the $40/day the same workload would cost under GPT-5.4, and roughly 70% off what GPT-5.5 would run. The spread against Sonnet 4.6 is similar, about 55% cheaper on total cost.

The legal document system tells a different story. They hit about 15% cache hit rate because every document is unique. The price advantage narrows but doesn’t disappear. V4 Pro still comes in about 35% cheaper than GPT-5.4 and 60% cheaper than GPT-5.5.

But here’s what the pricing page doesn’t tell you: quality consistency at volume. We’ve seen V4 Pro degrade differently than other models when you push concurrent request rates above a certain threshold. Latency spikes are more common than the docs suggest.

“CONTEXT LENGTH — 1M | MAX OUTPUT — MAXIMUM: 384K”

— DeepSeek API Pricing Docs

That 1M context window and 384K max output are staggering numbers compared to most competitors (typically 128K context, ~4K-32K output). It’s a genuine competitive differentiator, but only if your workload can use it.

Also worth noting: the caching mechanism is less transparent than it should be. There’s no API to check your effective cache hit rate. You have to infer it from your bill. For a pricing strategy that emphasizes cache savings, that’s a noticeable gap.

DeepSeek V4 Pro cache tier pricing: $2.00 cache miss vs $0.50 cache hit per 1M input tokens

“A pricing page is a promise. The invoice is the truth. DeepSeek’s promise is good — but you have to measure the truth yourself.”

— Dusty, field notes from production

Even accounting for extra engineering overhead to stabilize high-throughput workloads, V4 Pro is saving our clients an average of 40–50% on inference costs. When you’re processing tens of millions of tokens a day, that’s not Monopoly money.

Yes, with conditions

I’m saying yes to this pricing move, with conditions.

Permanent pricing removes the uncertainty that makes enterprise procurement nervous. When a CTO signs off on a model integration, they want to know what the cost picture looks like 12 months out. Temporary discounts poison that conversation. Permanent pricing, even at the discounted level, is a cleaner signal.

The conditions are the operational gaps:

No cache analytics dashboard
Uneven latency under load
Thin documentation on rate limits and retry behavior

“Concurrency Limit — V4 Flash: 2500 | V4 Pro: 500”

— DeepSeek API Pricing Docs

Those are high limits. 500 concurrent requests on V4 Pro is generous, but without visibility into your actual cache behavior, you’re flying partially blind.

I also want to see what happens when GPU supply tightens. DeepSeek’s cost advantage is partly architectural (MoE efficiency on V4 is genuinely impressive) and partly geopolitical. If global GPU supply gets squeezed, that advantage may compress.

“Permanent doesn’t mean forever. It means ‘for the foreseeable future, this is the price of admission.’ Take it, but keep your exit strategy warm.”

— Dusty

Where DeepSeek goes from here

If I were running the DeepSeek API product, I’d have done three things differently.

First: ship cache analytics on day one. The caching mechanism is one of V4 Pro’s strongest competitive advantages. It’s also invisible to the customer. Give every API user a dashboard showing cache hit rate by endpoint, by prompt prefix, by time window. That turns a passive pricing advantage into an active value lever, which increases switching costs without increasing prices.

Second: publish a reliability SLA for the discount model tier. The unspoken anxiety in every “cheaper AI API” conversation is reliability. Cheap doesn’t matter if the endpoint is flaky. DeepSeek should have paired the permanent pricing with a 99.5% uptime SLA on V4 Pro, backed by service credits.

Third: launch a “bring your own context” tier. A tier where you prepay for context slots at reserved capacity for your prompt prefixes at below-market cache rates would be a killer product for high-volume enterprise workloads. It’s what AWS did with Reserved Instances, applied to inference.

These three moves would turn a pricing announcement into a platform strategy.

Bottom Line

DeepSeek V4 Pro’s permanent discount is real and significant. Expect 40–56% savings over GPT-5.4 and Sonnet 4.6, and up to 70% off GPT-5.5 on comparable workloads. Kimi K2.6 undercuts everyone on raw price but carries ecosystem and governance risk. But the cache analytics gap, latency spikes under concurrent load, and thin operational documentation mean you need strong infrastructure chops to capture the full value. Buy the price, but budget engineering overhead for monitoring, retry logic, and fallback strategies. This is a cost advantage for teams that measure, not for teams that hope.

The market just got a margin call

DeepSeek V4 Pro’s permanent discount is the closest thing the AI API market has seen to an honest price. The margin is thin, the infrastructure is engineered, and the cost advantage is real, as long as you bring your own ops maturity. Temporary discounts expire. Permanent ones reveal strategy. This one reveals that DeepSeek is playing the long game on inference economics.

The rest of the market just got a margin call.

Sources

DeepSeek API Pricing — https://api-docs.deepseek.com/quick_start/pricing
DeepSeek V4 Pro Model Documentation — https://api-docs.deepseek.com/models/deepseek-v4-pro
OpenAI API Pricing — https://openai.com/api/pricing/
Anthropic API Pricing — https://docs.anthropic.com/en/docs/about-claude/pricing
Kimi K2.6 API Pricing — https://platform.kimi.com/docs/pricing/chat

DeepSeek V4 Pro's Discount Goes Permanent — The Real Cost of Inference

A discount that stayed

What the pricing page actually says

The real cost picture

Bottom Line

Three projects, 150M tokens

Yes, with conditions

Where DeepSeek goes from here

Bottom Line

The market just got a margin call

Sources

Ready to build something that works?