Stop Picking Models. Pick What's Stable Around Them.
Stop Picking Models. Pick What's Stable Around Them.
The last three weeks rearranged the AI stack faster than the previous twelve months.
On March 4, Anthropic shipped Claude 3. Opus undercut GPT-4 on most public benchmarks. Sonnet undercut GPT-4 on price. Eight days later Cognition Labs dropped a demo video of Devin, which they framed as "the first AI software engineer," claiming a 13.86% solve rate on SWE-bench against single-digit unassisted baselines. Yesterday, the LMSYS Chatbot Arena rankings updated. As Simon Willison put it, the king is dead: Claude 3 Opus overtook GPT-4 on blind human preference for the first time since GPT-4 shipped.
If you spent the last year building your product on top of "the GPT-4 default," you are now spending this week in a conversation about whether to migrate. If you spent the last year building on top of an abstraction layer with per-provider evals, you are spending this week sending a one-line config change to your CI and watching the metrics.
The lesson of March 2024 is not that Claude 3 won. It is that picking a model is the wrong unit of architectural decision. The model is the most volatile part of your AI stack. Hardcoding it is no longer a "we made a call" call. It is a bet that costs you whenever the leaderboard shifts, which is going to be roughly every six to eight weeks for the foreseeable future.
The Vendor Lock-In Tax Just Got Real
Twelve months ago, "we standardized on GPT-4" was a defensible architectural choice. There was no real alternative. The model was the only one in its class. The cost of building abstractions over an interface with one credible implementation is high and the benefit was theoretical.
That world is gone. Sonnet's pricing alone (more than three times cheaper than GPT-4 Turbo on input tokens, an order of magnitude cheaper than the original GPT-4, similar quality on most tasks) means the cost of not swapping in the right places is real money, not future regret. Add Mistral, Cohere, Gemini, the open-weight Llama family, and the math is now: every quarter, there is a model that beats your current default on some axis that matters for your product.
The teams I see scrambling this week are the ones who treated model selection as a one-time decision. The prompts are tuned to GPT-4's quirks. The output parsers expect GPT-4's formatting habits. The evals (if they have any) are scored against GPT-4. The infrastructure assumes the OpenAI SDK. Migrating is not just changing an import statement. It is reproducing every behavior that was implicit in the old model on the new one, and verifying it on the use cases that actually matter for users.
This is the exact shape of vendor lock-in that the industry spent twenty years learning to avoid for databases and cloud providers. We did it to ourselves again with models, because the field moved too fast for our instincts to catch up. The good news is the lesson is the same one, and you already know it.
The Two Layers That Have to Be Stable
What survives the next eighteen months is a stack with two stable layers and one volatile one.
The capability interface (stable). This is the API your product code calls. Not openai.chat.completions.create. Something closer to summarizer.summarize(document) or classifier.label(text, taxonomy). The product layer asks for a capability. It does not know which provider is on the other end. Critically, the capability interface is defined by what your product needs, not by what any one provider offers. Tool calling, structured output, streaming, retries, fallbacks, all hidden behind it.
The eval suite (stable). This is the thing that tells you whether a swap is safe. For every capability in your interface, you have a labeled set of inputs and either a graded rubric, a judge model, or a programmatic check that lets you score any candidate model. The eval suite is the load-bearing artifact. It is the thing you protect across model migrations. I wrote last year that the gap between teams shipping AI features and teams shipping AI products was the eval gap. That has gotten more true this quarter, not less.
The model (volatile). Behind the interface. Swappable per capability, per environment, per request if you want. Each provider has its own client adapter that takes the capability call and maps it to the provider's API. When a new model lands, you write the adapter, run the eval suite, and ship the config flip if the numbers say yes.
The shape is not new. It is what we have always done when we had a fast-moving component we did not control. Storage engines, payment processors, identity providers. We abstracted the volatile thing and stabilized the contract. The same instinct applies to models. The reason it feels harder is that LLM behavior is fuzzy, so the contract has to include the eval suite, not just the function signature.
Devin Tells You the Same Thing About the Layer Above
Devin gets a lot of attention this week as a model-tier story (look how good agents are getting). I think the more interesting read is architectural. The orchestration layer (the agent loop, the tool selection, the planning) is its own moving target.
Devin claims to be an end-to-end autonomous engineer. The demo is impressive. Gergely Orosz called it a "heavily work-in-progress prototype" within a week of launch, and his read is the one I trust: the headline framing is marketing, the underlying agentic work is real but early. The 13.86% SWE-Bench number is impressive in context, and unproductized in practice. You should not bet your product on Devin as a vendor today.
But what Devin tells you about your architecture is the same thing Claude 3 tells you. The orchestration layer (the harness around the model that does planning, tool calling, evaluation, retries) is going to look very different six months from now. If you build your agentic system on top of one specific framework or one specific provider's "agent" abstraction, you are setting up a second migration after the model migration.
The same two-layer principle applies. There is a capability interface (what your product asks an agent to do) and an eval suite (how you know an agent is producing the right outputs). The agent framework, the planner, the tool dispatcher, all of that is the volatile component. It should sit behind your interface, not be your interface.
What This Looks Like in Practice
The smallest version of this that works:
- Pick three to five capabilities your product actually uses an LLM for. Not "the AI feature." The discrete things. Summarize a thread. Extract structured data from an email. Generate a draft response. Classify a document.
- For each one, write a function signature. Strongly typed inputs and outputs. No
dict[str, Any]. No "we'll figure out the schema later." The signature is the contract. - For each capability, write a small eval set. Twenty to fifty real examples with expected outputs (or a rubric). Run it against your current production model and record the baseline.
- Implement the capability against your default provider. Then write a second adapter against a second provider. Run the eval suite against both. Diff the scores.
- Ship the routing logic. Default to whichever model wins your eval suite. When a new model lands, write the adapter, run the suite, decide.
This is not novel. It is what any team that has done this for six months has converged on. The point is doing it before you have to, not after.
The Surface Argument Is the Same Argument
I argued last month that putting AI behind a chat box is product engineering laziness. The reason is the same as this one: the AI is the volatile thing. If the surface is a chat box and the model changes underneath, your users feel the model change. They notice the new refusals, the new tone, the new failure modes. If the surface is a button that summarizes a document, the user feels the button, not the model. The model can change underneath and the user does not notice, as long as your evals say the summary is still good. Hidden capabilities are easier to swap. That's one more reason to bury the LLM behind familiar UI.
Where I Land
If you are leading product engineering on an AI feature right now, the only architectural choice that matters this quarter is whether your eval suite and your capability interface are stronger than your provider commitment. If yes, the next six leaderboard inversions are config flips. If no, every one of them is a migration.
Stop picking models. Pick the things around the model. Then let the model move.