Commit Your Specs (We're Almost There)

Commit Your Specs (We're Almost There)

Two weeks ago, OpenAI shipped o1-preview. I have been sitting with what it changes, and I think the change is bigger than the discourse is giving it credit for, in a direction the discourse is not yet looking.

The framing in the launch post is straightforward: o1 was trained with reinforcement learning to think before answering, with performance that "consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)." A new axis of scaling. OpenAI researcher Noam Brown summarized the implication on X the day of launch: "We're no longer bottlenecked by pretraining."

The day-of-launch reaction from the people I read tracked the model's strangeness more than its capability. Simon Willison's first impression is the line that stuck with me:

o1 prompting is alien to me. Its thinking, gloriously effective at times, is also dreamlike and unamenable to advice. Just say what you want and pray.

Ethan Mollick, who'd been previewing the model for weeks, described the workflow shift more bluntly:

I no longer feel as connected to the AI output, or that I am playing as large a role in shaping where the solution is going.

Those two quotes, from two people I trust on practical AI use, are pointing at the same thing. If you put planning and verification inside the model, the human's role shifts away from the loop the human used to live in. You stop nudging tokens. You write the problem, and the model figures out the path.

That is a different relationship to the work. And it raises a question I think is going to be the central product engineering question of 2025: if the model is doing the reasoning, what is the human supposed to be writing?

The honest answer, the one I keep arriving at and keep being uncomfortable with, is: the spec.

The Spec As The Load-Bearing Artifact

For most of the history of software engineering, the source code is the load-bearing artifact. The spec, when it exists at all, is a sketch in a doc or a ticket. The code is the truth. The spec is a description of the truth that goes stale within a sprint.

That order was correct because the code was the thing that took the longest to produce and was the hardest to change. The spec was cheap to throw away. The code was expensive to write and so we treated it as the asset.

If you take Brown and Mollick and Willison seriously, that order starts to invert. If a reasoning model can take a clear specification and produce a working implementation, the code becomes cheap. The expensive artifact, the one that has to be right, the one that has to be versioned and reviewed, is the spec.

I do not think we are there yet. I want to be clear about that. o1 cannot take a real product spec today and produce a real production implementation. Not at the quality bar that ships to users. Not at the scale of a real codebase. The latency is wrong. The cost is wrong. The reliability is wrong. We are watching the first version of something, not the productized form of it.

But the shape is suddenly legible in a way it was not three months ago. And the shape suggests that the artifact engineering teams should be getting better at producing, right now, in advance of the model that will actually consume it, is a written specification of what they want the system to do.

Commit your specs. Treat them like code. Version them. Review them. Run lint against them. Make them the input to the build, not a description of what was built.

The Hesitation

I am hesitant about this for a specific reason, and the reason is QA.

In the current world, the source code is what determines behavior. If the same code runs twice, you get the same behavior. The mapping from artifact to behavior is deterministic. CI runs the same tests against the same code and gets the same answer.

In the inverted world, the spec is the artifact and the model is the compiler. The mapping from spec to behavior is non-deterministic. Even at temperature zero with a fixed seed, the determinism you get is determinism within a single model version. The model itself will get a new version. The implementation will get regenerated. The product will change.

This is a fundamentally harder QA problem than anything we have built infrastructure for. The thing you need is not "do the tests pass on this implementation." The thing you need is "does the regenerated implementation satisfy the spec, behave the same as the prior implementation on every case we have a baseline for, and not regress any user-facing behavior." That is a regression test suite at the level of the spec, not the code. And we don't have those.

Hamel Husain made the evals-are-the-load-bearing-piece argument back in March, in the context of LLM features, and his framing has only gotten more correct since. His line:

Many people focus exclusively on [prompt engineering], which prevents them from improving their LLM products beyond a demo.

In this inverted world, the evals problem expands from "is the LLM feature behaving right" to "is the entire system that the LLM regenerated behaving right." Everything I have argued about evals in previous posts compounds when the model is producing the system, not just a string.

If this is going to work, it will work because the QA infrastructure around it is airtight. The non-determinism of regeneration has to be absorbed somewhere, and the place it gets absorbed is the regression suite that holds the spec to its prior behavior. That suite is not going to write itself. The teams that build it now are the teams that will be shipping when the model is ready. The teams that don't will be writing migration plans.

What Committing Specs Looks Like Right Now

I am not going to claim there is a settled practice for this. There isn't. But here are the practices I am seeing the more thoughtful teams converge on, in the order I think they matter:

Write specs as code, not docs. Markdown in the repo, next to the system it describes. Reviewed in pull requests. Versioned. The spec is not an artifact of the planning meeting. It is an artifact of the codebase.

Make the spec executable, even when it cannot generate. Today, your spec cannot produce the implementation. But it can produce the test suite. It can produce the type signatures. It can produce the acceptance criteria. Write the spec at a level of precision where those derivative artifacts can be generated from it, even if the implementation cannot be yet. This is the bridge state.

Build the regression rig now. Capture the behavior of your current implementation. Snapshot tests, golden-file tests, expected-output checks. Lots of them. If you ever want to regenerate the implementation, this is the only thing that will tell you whether the regeneration was safe. Do this before you need it, because you will need it.

Treat the spec as the source of truth in conflicts. When the code does something the spec does not describe, that is a bug in one of them. Decide which. Update the other. The discipline of keeping them in sync is what makes the spec real instead of decorative.

The teams I see building agentic pipelines are already doing pieces of this, because the pipeline pattern (planner → implementer → test agent → CI agent) is hungrier for explicit specs than human engineers are. The agents need to know what they are building. The clearer your spec, the better the pipeline output. The bridge is short.

Where I Actually Land

Call the pattern what you want. Spec-driven development, if we settle on a name. Regeneration. Whatever. It is what comes after the agentic pipeline. o1 is the first signal that the model layer is ready for it. I think we are eight to eighteen months away from a real version of this being viable at the quality bar a production product requires, and five years away from it being the default workflow on a healthy team.

The path runs entirely through QA infrastructure. Not through better models. The models are going to keep getting better whether the industry asks them to or not. The thing the industry has to build is the verification layer that makes regeneration safe. Without that layer, this pattern is a fantasy. With it, it is the next paradigm.

If you are leading engineering right now, the move is not to migrate to anything this quarter. It is to start writing real specs, building real regression coverage, and being honest about the gap between the two.

We are almost there. We are not there. Commit your specs anyway. They are the thing the next phase needs, and the only thing that's still in your control.