80% of Our Code Is AI-Generated. Here's What We Learned.

Across my teams at Dropbox, 80% of new code committed in the last quarter was generated by AI tools. That number sounds shocking until you understand what it actually means. When Google CEO Sundar Pichai told investors in October 2024 that over 25% of new code at Google was AI-generated, people were surprised. GitHub's Octoverse 2024 report showed 73% of developers using AI tools. Six months later, 25% feels quaint. It doesn't mean engineers are idle. It doesn't mean quality dropped. It means the nature of engineering work shifted, and the shift happened faster than anyone predicted.

This post is a frank accounting of what that looks like in practice. Not the hype version, not the doom version. Just: here's what we observed.

What "80% AI-generated" actually means

The metric comes from our commit analysis tooling, which tags code based on whether it originated from an AI tool (Cursor, Claude Code, Copilot) before being committed. When I say 80%, I mean: the first draft of that code was produced by a machine. A human then reviewed it, refined it, and decided to ship it.

The human contribution didn't shrink. It shifted. Engineers spend less time typing and more time directing, reviewing, and architecting. The analogy I keep using internally: it's like going from hand-drafting blueprints to using CAD software. The architect's judgment matters just as much. The mechanical execution got automated.

Here's the breakdown by activity that I see across my org:

Writing new functions/methods from scratch: 90%+ AI-generated first drafts
Refactoring existing code: ~70% AI-assisted
Architecture decisions and system design: Still 100% human
Code review: Still 100% human (though AI summarization helps triage)
Debugging production issues: ~50% AI-assisted for hypothesis generation, still human-driven for root cause

The workflow stack

We've settled on a three-tool setup that covers different granularities of work.

Copilot handles autocomplete. It's the background hum. Engineers accept maybe 40% of its suggestions, reject the rest. Low friction, moderate value. Nobody thinks about it much anymore, like spell-check.

Cursor is the workhorse for inline work. Chat-driven refactoring, explaining unfamiliar code, generating implementations from comments. Engineers live in Cursor for their day-to-day. The "select code, ask a question" flow has replaced Googling and Stack Overflow almost entirely.

Claude Code handles the bigger tasks. Implementing a feature from a design doc. Writing a full test suite for a module. Debugging a complex issue across multiple files. I wrote about this shift in Scaling Agentic Engineering Without Breaking Your Team. The collective wisdom in Eugene Yan et al.'s What We've Learned From A Year of Building with LLMs applies here too: the model isn't the product, the system around it is. The key insight: Claude Code works best when you give it clear boundaries and let it operate autonomously within them, then review the output.

The combination matters. No single tool covers the full spectrum. Engineers who try to use only one tool hit walls constantly.

What improved

Time-to-first-draft collapsed. A feature that used to take a day to get a working skeleton now takes an hour or two. The skeleton might need refinement, but having something runnable to iterate on changes the entire feedback loop.

Boilerplate is essentially free. API clients, data models, serialization code, CRUD operations. All of it just appears. This was the first category where AI generation became uncontroversial, back in early 2024. Now nobody even thinks about writing a JSON decoder by hand.

Test generation is surprisingly good. Give Claude Code a module with clear interfaces and ask for unit tests, you'll get 80-90% of what a human would write. The tests are often more thorough than what engineers write under time pressure, because the AI doesn't get bored writing the seventh edge case.

Documentation generation is excellent. READMEs, API docs, inline documentation for complex functions. AI writes clearer documentation than most engineers because it's patient enough to explain context that humans skip (since they already have it in their heads).

What didn't improve

Architecture decisions are still human. AI can implement any architecture you describe. It cannot tell you which architecture to choose. When I ask Claude to design a system, it gives me a reasonable answer, but "reasonable" isn't the same as "right for this team, this scale, this set of constraints." Architecture requires organizational context that doesn't fit in a prompt. This deserves its own treatment. I keep thinking about what this means for the definition of "senior engineer" specifically.

Cross-team coordination is still human. The hardest part of building software at scale was never writing the code. It was getting three teams to agree on an API contract, navigating conflicting priorities, deciding what to build next. AI doesn't touch any of this.

The last 20% is still manual. Getting code from "works in the happy path" to "handles every edge case, performs well under load, fails gracefully, and is obvious to the next person who reads it" still requires human judgment. AI gets you to 80% fast. The remaining 20% takes the same time it always did.

Understanding product requirements is still human. AI can turn a spec into code. It cannot tell you whether the spec is right. The gap between "what the PM wrote" and "what users actually need" requires human intuition, domain knowledge, and the willingness to push back.

Surprising findings

A few things we didn't expect:

AI-generated code is more verbose. Not dramatically so, but consistently. It creates more abstractions, more helper functions, more intermediate variables. It over-comments. A function a human writes in 15 lines becomes 25 lines from Claude. Each individual decision is defensible, but the aggregate effect is code that's slightly harder to read because there's more of it.

The review question shifted. Code review used to ask "does this work?" and "does this have bugs?" Now it more often asks "is this the simplest version that works?" We added a review prompt to our guidelines: "Could this be 30% shorter without losing clarity?" That question catches a lot of AI bloat.

Seniors became more valuable, not less. When juniors can generate code at the same velocity as seniors, the differentiator becomes judgment. Knowing what not to build. Recognizing when the AI's approach is technically correct but architecturally wrong. Spotting the subtle bug in a 200-line generated function that passes all tests but will fail under concurrency. Senior engineers are busier than ever, because review load increased.

AI writes fantastic tests for the wrong things. It will generate exhaustive unit tests for a pure function. But it doesn't know which pure functions matter. It will skip the critical integration point that's the actual source of production bugs. Test strategy still requires a human who understands where the system breaks.

The testing story in detail

This deserves its own section because testing is where AI generation is both most impressive and most misleading.

Unit tests: A+. Give AI a function signature and it will generate comprehensive test cases, including edge cases you'd forget. Our unit test coverage increased 34% since we started using AI for test generation. These tests catch real bugs.

Integration tests: C+. AI writes integration tests that test the integrations it can see. But real integration failures happen at boundaries the AI doesn't know about: race conditions between services, eventual consistency gaps, timeout behaviors under load. AI-generated integration tests give you confidence that the happy path works. They don't catch the failures that wake you up at 3am.

End-to-end tests: Human-designed, AI-implemented. We found the best workflow is having a senior engineer outline the critical user journeys, then letting AI generate the test implementations. The design requires product knowledge. The implementation is mechanical.

Cost analysis

This isn't free, and the numbers are bigger than most leaders expect.

The seat licenses are straightforward: Copilot and Cursor are published pricing. The variable cost is API usage for tools like Claude Code, which scales with how aggressively teams use agentic workflows. Teams doing heavy AI-assisted development can easily spend several hundred dollars per engineer per month on tooling. Teams with power users running complex multi-file agents push well past that.

Is it worth it? The velocity numbers say yes. Engineers are shipping 40-60% more PRs per sprint. Features that used to take two weeks now take one. But there's a hidden cost.

Review time increased per-PR. Each PR takes 15-20% longer to review because: the code is more verbose, the reviewer needs to verify AI output more carefully than human-written code (trust is still building), and the volume of PRs is higher so reviewers have less context on any single change.

Net-net: engineers ship more, but the review bottleneck tightened. We're still working out the equilibrium.

What we're changing

Based on six months of operating this way, we're investing in four areas:

AI-powered review tools. If AI generates the code, AI should do the first pass of review. We're using Claude to summarize PRs, flag potential issues, and check for common AI-generation patterns (unnecessary abstractions, verbose implementations). Human reviewers then focus on architecture and correctness rather than style.

Shared prompt libraries. Every team was developing their own prompts for common tasks. Now we maintain a shared repository of tested prompts for our most common patterns: API endpoint generation, test writing, migration scripts, documentation. This standardizes quality and reduces the variance in AI output across teams.

"AI architect" pairing sessions. Weekly sessions where senior engineers pair with juniors specifically on AI tool usage. Not "how to use Cursor" (everyone figured that out). Rather: "how to decompose this problem so the AI can solve it well" and "how to recognize when the AI is going off track." Prompting is a skill. Knowing when to accept vs. reject is a skill. We're treating them as skills worth teaching explicitly.

Better boundaries for autonomous agents. As I wrote in Code in the LLM Era, the key to scaling agentic tools is giving them clear guardrails. We're defining per-repository rules about what AI agents can modify autonomously (test files, documentation, boilerplate) vs. what requires human-in-the-loop (API contracts, database schemas, auth flows).

The mental model shift

The biggest adjustment isn't technical. It's psychological. Engineers who grew up writing every line of code themselves have to accept that the job changed. You're no longer primarily a code writer. You're a code director. You set the vision, you review the output, you catch the mistakes, you make the judgment calls.

Some engineers love this. They always wanted to work at a higher level of abstraction. Some hate it. They became engineers because they love writing code, and now a machine does that part.

I don't think there's a right answer about whether this is good. It's happening. The engineers who thrive are the ones who treat AI as a power tool that amplifies their judgment, not a replacement for it. The ones who struggle are either over-trusting (shipping AI output without sufficient review) or under-trusting (rewriting everything the AI produces from scratch).

The sweet spot is calibrated trust. Know what AI does well. Know where it fails. Adjust your review intensity accordingly. That calibration takes months to develop, and it's different for every domain and every tool.

We're seven months into this era and still learning. The 80% number will probably climb. Hamel Husain's argument in Your AI Product Needs Evals keeps ringing true: the root cause of unsuccessful AI integration is failing to build robust evaluation. That applies to AI-generated code just as much as AI-generated products. The question isn't whether AI writes the code. It's whether humans still understand the systems they're building. Right now, the answer is yes. Keeping it that way requires intentional investment in architecture skills, system thinking, and the judgment that no model can replicate yet.