Reasoning Is a Commodity Now

Five weeks. That is how long it took for "reasoning" to go from OpenAI's flagship moat to a feature you can run on the server of your choice for an order of magnitude less money.

On January 20, DeepSeek released R1. MIT license, full weights, six distilled variants, frontier-tier performance, an order of magnitude or more cheaper inference than o1 depending on how you measure. By the time NVIDIA was selling off on January 27, Ben Thompson had the cleanest summary in print: "instead of paying OpenAI to get reasoning, you can run R1 on the server of your choice at dramatically lower cost." Two days earlier, swyx had named the pattern: reasoning was less of a moat than anyone thought, and o3-mini had to be free in response.

Yesterday, Anthropic shipped Claude 3.7 Sonnet with reasoning baked into the regular model as a developer-controlled dial instead of a separate SKU. The framing in their announcement is unusually direct: "just as humans use a single brain for both quick responses and deep reflection, we believe reasoning should be an integrated capability of frontier models rather than a separate model entirely." That is the frontier lab telling you out loud that the "reasoning model" product shape they were charging a premium for two months ago is over.

The architectural question for product engineering is not "which model." It hasn't been "which model" for a while, as I argued last March. The architectural question is now: where in your stack do you absorb the regeneration cost.

What the R1 release actually changed

The capability story is the headline. The cost story is the one that matters for product.

Nathan Lambert read R1's release the morning after as "the technical recipe for reasoning models went public." Once the recipe is in the open, the closed labs can't price reasoning as if it were a moat. Lambert was direct: "reasoning research and progress is now locked in, expect huge amounts of progress in 2025 and more of it in the open."

Simon Willison was running a quantized R1 distill on his laptop within 24 hours, with visible <think> tags exposing the chain of thought, and the 70B variant running by the next day. The architectural fact that a frontier-class reasoning model now sits on a developer's hard drive is what makes the "where in the stack" question even legible. It wasn't legible six weeks ago. The model layer was a hosted endpoint and that was that.

By mid-February, several inference providers including Cerebras, Groq, Together, and Fireworks were hosting R1 distills at sub-cent-per-thousand-token prices. The product engineering implication is that reasoning is now an ingredient you can buy by the megatoken from at least four serious vendors, run yourself on commodity GPUs, or fall through to the frontier labs for. Three layers, three cost profiles, same capability.

The regeneration cost moves architecture

Cheap reasoning sounds like good news. It is. It is also a load-bearing change to how your stack absorbs cost.

A reasoning call costs more output tokens than a one-shot completion. Often two to thirty times more. Those tokens are mostly chain-of-thought that you throw away. The model produces 8,000 tokens of reasoning and 200 tokens of answer. You bill the customer for, or eat, all 8,200.

When reasoning was scarce and expensive, you did this rarely and grudgingly. You used a non-reasoning model for most calls and reached for o1 only when the task was hard enough to justify the latency and price. Architecture absorbed the cost at the prompt-engineering layer: which model do I route this query to?

When reasoning is cheap, the temptation is to reach for it constantly. The architecture has to absorb the cost somewhere else. Three places, in order of how much I see teams getting them right:

At the routing layer. Build a capability interface where a "reason about X" call has explicit budget. Some calls get 200 tokens of thinking; some get 8,000. The routing decision is your product engineering, not the model's. If your code is reaching for a reasoning model without telling it a thinking budget, you have already lost. Anthropic's new Claude 3.7 thinking parameter takes a budget_tokens value for exactly this reason.

At the QA layer. This is the one I worried about in September and it has gotten more important, not less. If the spec is the load-bearing artifact and the reasoning model regenerates the implementation, you need regression coverage at the spec level. Non-determinism scales with reasoning depth. The 5,000 tokens of chain-of-thought your agent produced today are not the 5,000 tokens it produces tomorrow on the same input. Your tests have to be invariant to the chain, sensitive to the outcome.

At the inference location. Hosted reasoning is fast to ship and expensive at scale. Self-hosted reasoning is cheap at scale and slow to ship. Most teams will run hosted in 2025 because the latency and ops costs of self-hosting an R1 70B distill are real. But the math flips around six-digit monthly inference spend, and "where do you run inference" is now a quarterly architecture review, not a one-time decision. If you don't have a slot on your roadmap to evaluate self-hosted variants by Q3, you are paying a premium for nothing.

What I'd do this quarter

This is the architectural moment of 2025 for product engineering teams shipping AI features. Most of the work is upstream of any specific provider.

Audit your reasoning calls. Every call that uses a reasoning model should have an explicit thinking budget set in code. If the answer to "how much was the model allowed to think" is "as much as it wanted," that is a config bug that will show up as a cost spike eventually.

Get one open-weight model evaluated against your eval suite. Not in production. Just on the eval suite. Pick an R1 distill or Llama variant, run it through Together or Fireworks, and produce a comparison number. The act of getting that comparison number in your repo is the move. Whether you swap is a different conversation.

Make the spec the artifact, not the prompt. If the prompt encodes your product behavior and the model changes underneath, you regress in ways your evals can't see because the evals were written against the old model's quirks. Specs at the level of "this capability does this, given this, returns this" survive the swap. Prompts don't.

Renegotiate your inference contract. If you signed an annual hosting deal at 2024 prices, those prices are now generous to your vendor. The cost curve through 2025 is downward on a 3-to-4 month cycle. Don't lock in.

Where I land

The teams that built model-agnostic shapes last year are calmly running this quarter. They are pulling pricing comparisons, looking at distill variants on their evals, and getting the savings on their cloud bills next month. The teams that hardcoded a provider are explaining to leadership why their reasoning costs haven't dropped even though the leaderboard keeps moving.

Reasoning is a commodity now. Architecture absorbs that change at three places: routing, QA, and inference location. The capability interface and the eval suite are still the load-bearing layers, the way I argued last March. Reasoning being cheap doesn't change what's stable. It moves more weight onto it.

The work for the rest of the year is figuring out where to absorb the regeneration cost without pushing it into the user's experience. That is what stable product engineering looks like through the next leaderboard inversion. And the leaderboard is going to invert.