From Copilot to Pipeline

Six weeks ago, Claude 3.5 Sonnet shipped and the Cursor users I know stopped sounding tentative about AI-assisted coding. The trust threshold flipped. Simon Willison called it likely the single best available model on the day it dropped, and the engineers on my team who had been skeptical of Cursor in March are now defaulting to it in July. Anthropic's own internal "agentic coding" benchmark shows Sonnet 3.5 solving 64% of tasks versus Opus at 38%. A real jump, not a marketing one.

The conversation in every engineering org I touch is the same: copilots work now. Cursor is great for throwaway scripts and one-off refactors. GitHub Copilot is a productivity floor we no longer argue about. If you need the code to be right and you want to verify the code yourself, having an LLM next to you in the editor is a meaningful speedup.

The question I want to ask is what comes after this. Because the copilot is not the destination. It is a stepping stone. And from where I am sitting, you can already see the next step.

What a Copilot Actually Is

A copilot is a model that suggests, and a human that accepts. The engineer is still the author of the code. The model is a faster typist with better recall of standard patterns. The engineer reviews each suggestion as it appears, accepts or rejects, edits, and moves on. The trust contract is per-suggestion. You see the diff before it lands. You are the bottleneck on quality.

That contract is good for a specific kind of work. It is good for code where you need to fully trust the result and the cost of a bug is high. It is good for throwaway code where the cost of a bug is low. It is good for unfamiliar libraries where the model can save you a documentation lookup. It is not good for anything where the bottleneck is your willingness to read suggestions one at a time.

The thing copilots are bad at is volume. If a copilot makes me 30% faster at writing code, but reviewing the suggestions is now my primary activity, the ceiling is set by my reading speed. The model can produce as many suggestions as I can accept. I cannot read faster than I read. The copilot pattern works exactly as well as your human attention budget, and that budget is fixed.

The Pipeline Shape

The next shape is not a copilot. It is a pipeline of agents that produce code, and a human who reviews the output.

The shape I keep seeing converge across teams I've talked to:

A planning agent that reads a ticket or a spec and produces a plan: which files to change, in what order, with what API contracts.
An implementation agent that takes the plan and writes the code, one file at a time, with the relevant context loaded for each file.
A test agent that writes tests against the change, runs them, and feeds failures back to the implementation agent.
A CI agent that watches the build, the linter, the type checker, and routes failures to the right earlier agent for a fix.
A human who reviews the resulting pull request.

This is not science fiction. Princeton open-sourced SWE-agent in April, which runs almost exactly this loop against real GitHub issues. Microsoft's AutoGen gives you the orchestration primitives for multi-agent setups. And it is what GitHub previewed when they announced Copilot Workspace in April: a "Copilot-native developer environment" where the surface is planning and specification, not typing code. It is what Andrew Ng argued for in March when he laid out the four agentic design patterns: Reflection, Tool Use, Planning, and Multi-Agent collaboration. He framed agentic workflows as the biggest near-term lever in AI, possibly bigger than the next foundation model. Six months on, that prediction is aging well.

The engineer's job in this shape is not to author code. It is to write the spec, set the boundaries, and review the pull request. The agents handle everything in between, including the boring parts that copilots leave to you, like running the test suite, debugging a flaky CI run, or reformatting an import.

Why the Pipeline Beats the Copilot on Throughput

Copilot throughput is bounded by your reading speed because every suggestion requires synchronous human review.

Pipeline throughput is bounded by your reviewing speed, which is a different number. Reviewing a finished pull request is fundamentally different from reviewing a stream of incremental suggestions. You see the whole change. You see the tests. You see the CI status. The unit of human attention is the pull request, not the line.

A reasonably skilled engineer can review around five small pull requests in a focused hour, give detailed feedback on two of them, and approve the rest. That is roughly five tasks worth of throughput per engineer per hour. The copilot shape, with you as the live reviewer, is not going to hit that, because too much of your time is spent on micro-decisions instead of macro-decisions.

I wrote in June about engineers becoming agent managers, and the pipeline shape is what that looks like in practice. The rhythm is closer to managing direct reports than writing code: define the work, set context, review the output, provide feedback, iterate. One engineer should be able to keep three to five agents in flight at once, the same way a tech lead keeps three to five reports productive.

The Devin Question

The maximally autonomous version of this is what Cognition was trying to demo with Devin back in March: skip the harness, just hand a task to an agent and let it ship. The reception in the engineering community has been skeptical, and rightly. Gergely Orosz called it a heavily work-in-progress prototype within a week. The 13.86% SWE-Bench number Cognition reported is impressive in context and not productized in practice.

But the skepticism about Devin is mostly skepticism about one implementation, not the shape of the bet. The shape is right: agents do the work, humans review. The Devin pitch is wrong about the human's role (it pitches "you delegate, the agent ships, you move on"), and that is what is breaking on first contact with anyone who has shipped production code. You cannot fully trust an agent's output without a review step. Not in 2024, possibly not for a long time, possibly never for high-stakes systems.

The right shape is not "skip the human." The right shape is "keep the human at the pull request, take the human out of every other step." The agents do the slogging. The engineer does the judgment.

What This Changes for Product Engineering

The pipeline shape changes the work in a few specific ways that matter if you are running an engineering org.

The unit of an engineer's day shifts from authoring to reviewing. This sounds like a small thing. It is not. The skills that make a good author (knowing the syntax, knowing the library, typing fast) are not the skills that make a good reviewer (understanding the change in context, spotting subtle bugs, judging whether the approach is right). Some of your strongest authors are going to be mediocre reviewers, and vice versa. You will see this in your data within a quarter once teams adopt the shape.

The eval suite stops being optional. I argued in March that evals were the load-bearing artifact of any AI product. They are doubly load-bearing here. The pipeline produces a lot more code than a copilot did. You cannot scale human review to match the volume. You need automated quality gates between the agent steps. That is what evals are. If your CI is the only quality gate, you are going to push the bottleneck to CI, and CI will fall over.

Junior engineers are going to be reviewing pipeline output too. This is the part I am most uncertain about. Reviewing AI-generated code is a senior skill: you have to spot the plausible-looking-but-wrong patterns, the test that tests nothing, the abstraction that does not match the codebase. A junior engineer cannot reliably catch these yet. The move I would start with on my own teams is to pair juniors on AI-generated PR reviews with a senior reviewer, the same way we used to pair them on writing code. Reviewing is the skill they need to develop, and they will not develop it by approving alone.

The "AI feature" surface for end users is the same shape as the pipeline. The same argument I made in February about chat applies here. The reason chat is the wrong default for end-user AI features is that it asks the user to be the author and the reviewer at once. The pipeline shape (the model does the work, the user reviews the output) is also the shape end users want. They want to see the finished artifact, decide if it is right, accept or revise. They do not want to type prompts.

Where We Actually Are

The pipeline is not fully here in July 2024. Copilot Workspace is in technical preview and not yet production-grade. Devin is a demo. The internal pipelines my peers are building are duct tape with strong opinions.

But the trajectory is clear, and the trajectory matters more than the snapshot. Within twelve months, I expect the default workflow for non-trivial features on a healthy team to be: engineer writes a spec, pipeline produces a pull request, engineer reviews. The teams that get there first are going to ship more, and not by the 30% copilots gave them. Two to three times more, on feature throughput, on the surfaces where the eval suite is strong enough to absorb the volume.

If you are leading engineering right now, the question is not whether to adopt copilots. That question is over. The question is whether you are building the eval suites, the review culture, and the pipeline scaffolding that will let you graduate past copilots when the models are ready, which will be sooner than you think.