AI development has moved from simple code completion to building full applications. The shift is accelerating and changing the role of engineers.

There's a version of this essay I could have written a year ago. It would have been cautiously optimistic — full of caveats about hallucinations, about needing to verify everything twice, about AI as a productivity multiplier if you were careful enough to treat it like a very fast intern who hadn't slept in three days.
I'm not writing that essay.
Something shifted. Not all at once — it rarely does — but the pace of that shift is the thing most people aren't talking about. The scope of what AI can handle has expanded dramatically over the last couple of years. But it's not expanding at a steady rate. It's accelerating. And if you're still calibrating your expectations based on where things were twelve months ago, you're already behind.
The first wave most engineers actually felt was code completion — the autocomplete that could finish your thought before you finished typing it.
This was genuinely useful and genuinely limited. AI was operating at the level of a single expression, a single function, a line you'd half-written. The model had no idea what you were building or why. It was pattern-matching against a vast corpus of code and making educated guesses about what came next.
The right mental model was autocomplete on steroids. Faster than typing. Sometimes surprisingly good. Occasionally confidently wrong in ways that would waste twenty minutes if you weren't paying attention. You still drove everything. The AI filled in the gaps.
What it changed: typing speed, boilerplate friction, the tedium of writing things you already knew how to write. What it didn't change: anything about how you thought about the problem.
The next shift was less about completion and more about delegation. AI got good enough — and context windows got large enough — that you could hand it a discrete, well-scoped task and get something usable back.
Write me a function that does X. Generate this schema change. Add error handling to this block. These are tasks with a clear input, a clear output, and a limited blast radius if they go wrong.
The workflow became: think, delegate, verify, integrate. There's a whole category of work — call it implementation grunt work — that you could now hand off. Not blindly. You still had to know what good looked like. But the actual doing of it became optional.
The cognitive shift here is subtle but important: you started developing a sense for what to delegate. That judgment — knowing which tasks were well-scoped enough to hand off — turned out to be a real skill, and one worth developing early.
A module isn't a task. It's a collection of related responsibilities with a defined interface and behaviors that need to work together coherently. Getting AI to produce good module-level code required something new: the ability to hold design intent, not just implementation instructions.
What made this work was getting better at spec. If you could describe what something needed to do — its inputs, outputs, invariants, edge cases, the shape of its interface — AI could produce a solid first draft. Not perfect. But close enough that you were reviewing and refining rather than writing from scratch.
This changed the nature of design work. A lot of what used to happen entirely in your head now happened in dialogue. You'd sketch the interface, AI would implement it, you'd look at the implementation and realize your interface was wrong, you'd revise. Bad designs became visible faster because you had working code almost immediately, not days later.
For engineers who could already design well, this was a significant multiplier. For engineers who couldn't, it was a false friend — the code looked right even when the design underneath was wrong.
By Q4 2024 I was using tools like Cursor and Augment Code to build real features in commercial products — features that would have taken significantly longer to write by hand. The code wasn't always as optimal as a hand-crafted solution, but it worked, it had test coverage, and it didn't regress between releases. That last part mattered more than people give it credit for. AI that writes tests alongside the implementation changes the risk calculus of delegation entirely.
But Stage 3 introduced a problem that nobody was talking about at the time, and that has only grown more acute since: nobody else could see what was happening.
When AI started completing meaningful chunks of work autonomously, the traditional visibility layer broke down. Developers had always communicated progress through commits, PRs, standups, and tickets — a paper trail that project managers and clients could follow. Agentic work doesn't naturally produce that trail. An agent can spend three hours working through a module, making dozens of decisions, writing and rewriting code, and surface with a finished result that gives almost no indication of what happened, what was considered, or why things were done the way they were.
For a solo developer, that's manageable. For a team — and especially for a client who needs to understand what they're paying for — it's a serious problem. The work accelerates. The visibility collapses. The old systems of tracking progress just don't fit the new shape of the work.
That visibility gap is exactly why we built CodeBake. Instead of trying to force fast-moving agentic workflows to sync up with traditional issue trackers, we realized the old model needed to be replaced entirely with a system designed for this new reality. The core idea is straightforward: agentic work should continue to move fast, but agents should be encouraged to document what they're doing in real time — not for the agent's benefit, but for everyone else's. Project managers get a live view of what's being built and why. Clients can see progress without needing a developer to translate it. Decisions that would otherwise be invisible — why this approach over that one, what tradeoffs were made, what was deferred — become part of the record.
The capability to build fast is only half of what professional software development requires. The other half is accountability — being able to show your work, explain your decisions, and keep non-technical stakeholders genuinely informed. Agentic tools gave us the first half. CodeBake is the second.
The jump from small, well-defined modules to large, complex ones isn't just a matter of scale. Large modules have internal complexity — subsystems that interact, state that evolves, edge cases that only emerge from the interaction of multiple components. They require sustained coherence across a lot of code.
Not long ago, this was still mostly out of reach. Context windows were too small, models lost the thread, and output quality degraded badly as scope expanded. You'd get something that looked plausible in isolation but didn't fit your actual system.
What changed was the combination of larger context windows, better reasoning, and agentic tools that could actually read your codebase. Suddenly the AI knew about your conventions, your patterns, your architecture decisions. It could produce code that didn't just work in the abstract but worked here, in this system, with these constraints.
This is also when the agentic loop became real. Run the tests. See what broke. Fix it. Repeat. Not you running the loop — the AI running it, with you watching and steering. The experience shifted from "write code for me" to "build this with me."
For me, the personal marker was Q1 2025: I stopped writing code by hand for clients. Not because I couldn't, but because it no longer made sense. My time became worth more spent reviewing solutions, catching architectural drift, and using forty years of judgment to steer the implementation toward something maintainable and forward-thinking. The actual typing had become the least valuable thing I could be doing.
There's a concrete marker for when this stage matured: task lists became unnecessary. For a while, the only way to keep an agent on track across a large chunk of work was to hand it an explicit list — step one, step two, step three — and have it check items off as it went. Without that scaffolding, agents would drift, lose context, or confidently pursue the wrong interpretation of an ambiguous requirement. The task list was essentially a prosthetic for working memory. By Q4 2024, agents were increasingly autonomous but leaning harder on those lists to stay coherent over long development sessions — the capability was there, but the working memory wasn't.
Through Q2 and Q3 of 2025 that changed noticeably. Each new model release meant less babysitting. Agents started holding context across longer efforts without needing to be walked through every step. The check-ins I'd been doing out of habit started feeling unnecessary for most work — though for high-stakes or complex architecture decisions, a lightweight plan still helped orient things. The heavy scaffolding was gone; the occasional signpost remained.
This is where we are today. And it still catches me off guard sometimes.
It's now possible to describe an application — its purpose, its users, its core workflows, its data model, its constraints — and work with AI to build the whole thing. Not a prototype. Not a proof of concept. A real, working application with real features, real edge cases handled, real tests written.
To be precise about scope: I'm talking about greenfield products — SaaS tools, internal platforms, standard web architectures built from a clean slate. If you're maintaining a fifteen-year-old enterprise monolith entangled with legacy integrations and undocumented tribal knowledge, Stage 5 isn't fully there yet. The capability exists; the context problem doesn't. That's a meaningful distinction, and worth being honest about.
Q1 2026 is when this became undeniable for me. Fully autonomous agents went mainstream, and the latest generation of models started delivering surprisingly well-architected solutions that required meaningfully fewer low-level judgment calls — not because I stopped paying attention, but because the agents were getting the routine architecture right more often on their own. The high-level judgment moments didn't disappear, but the constant low-level steering became the exception rather than the rhythm of the work.
This doesn't mean you hand it a paragraph and walk away. It means the scope of what you can delegate has expanded to the point where your primary job is architecture, product thinking, and taste — deciding what to build and whether what's being built is right — rather than the line-by-line work of implementation.
The code isn't worse for it. In some ways it's better, because AI is tireless about things humans tend to shortcut — test coverage, consistent error handling, documentation. The parts that require genuine judgment are still yours. They probably always will be.
Stage 5 is real, but it isn't frictionless. The remaining bottlenecks are worth naming honestly, because they're where the next wave of progress will come from.
Humans are still reviewing every PR. Not because AI can't write the code, but because no one fully trusts it to know what matters. Is this change safe? Does it fit the architecture? Does it solve the right problem? Those are judgment calls that still require a human sign-off. The AI produces the work; a human still has to stand behind it.
Prioritization is still entirely human. Agents are good at how. They're not yet good at what or why. Give an agent a well-scoped task and it will execute well. Ask it to decide what to work on next — weighing technical debt against new features against user feedback against business risk — and you're back to doing that yourself. The product thinking layer hasn't moved.
Agents still get stuck on the wrong things. Give an agent a task with any ambiguity and it will confidently pursue one interpretation, sometimes deep into a rabbit hole, before surfacing. A senior engineer would pause early and ask a clarifying question. Agents often don't know what they don't know — and that's a costly failure mode when you only discover it three hours later.
The through-line is that the remaining human role is judgment, not oversight of execution. We've largely automated the doing. We haven't yet automated the deciding.

Lay that progression out and look at the timeline. Code completion to small tasks was maybe six months. Small tasks to modules took another six to nine. Complex features followed within the year. Full applications — something that would have sounded like science fiction two years ago — are happening now.
Each stage didn't take as long as the one before it. The capability curve isn't linear. It isn't even consistently steep. It's accelerating.
This matters more than any specific capability, because it means your intuitions about what AI can and can't do are probably already out of date. If you tried an agentic tool six months ago and found it underwhelming, it's worth trying again. If you've been waiting for AI to get good enough for some category of work, it may have already crossed that threshold while you weren't looking.
The engineers I know who are falling behind aren't the ones who tried AI and rejected it. They're the ones who found a level that worked — usually somewhere around Stage 2 or 3 — and stopped updating. They're still getting value. But they're treating a fast-moving target as if it had stopped moving.
The practical implication of this progression is that what kind of work you do has changed, not just how fast you do it.
At Stage 1, you were still mostly a coder. At Stage 5, you're something closer to an architect and editor — someone who makes decisions, sets direction, and evaluates output rather than producing every line. That's a different job. It requires different skills to do well.
The engineers getting the most out of this shift have made a specific mental adjustment: they've stopped thinking about AI as a tool they use and started thinking about it as a collaborator they work with. Tools you operate. Collaborators you engage — you share context, you push back, you develop a working relationship over the course of a project.
Taste, judgment, system thinking, product sense — the things that are hardest to teach — matter more now, not less. AI can implement. It cannot decide what's worth implementing, or whether what was implemented is actually good. That's still you.
If the pattern holds — and so far it has — the next stage arrives before the end of this year. My best guess at what it looks like: agents that know when to stop and ask.
Right now the failure mode is binary. Either you over-specify — task lists, step-by-step instructions, constant check-ins — or the agent drifts. What's missing is calibrated uncertainty: an agent that can distinguish between "I know how to handle this" and "this decision has consequences I shouldn't be making unilaterally."
That probably manifests as smarter interruption — agents that pause at genuine forks in the road rather than arbitrary checkpoints. Not "checking in every ten steps" but "I've hit an architectural decision with long-term implications and I'm surfacing it before proceeding." Combined with persistent project memory — agents that carry real context about your decisions, your constraints, and your past mistakes across sessions — this would shift the human role from reviewer of output to arbitrator of genuinely hard calls.
The stage after that is where it gets interesting: agents that can participate in prioritization. Not replace human judgment about what to build, but contribute to it — surfacing technical risks, flagging scope creep, flagging when a requested feature conflicts with an earlier architectural decision. That's the transition from collaborator to something closer to a junior team member with real situational awareness.
We're not there yet. But given the pace of the last two years, "not yet" has a shorter shelf life than it used to.
I've been building software for forty years. I've lived through a lot of claimed revolutions — each one delivered on something like its promise, even if the timeline and shape were different than advertised.
This one is different in a specific way: the rate of change is itself changing. That's what makes it hard to calibrate and easy to underestimate. Every time you think you've found the ceiling, it moves.
Two years ago, AI could finish your sentences. Today it can build your application. The current bottleneck isn't capability — its judgment. But the next wave isn't coming to replace your judgment; it's coming to integrate with it more tightly, pushing you into the role of pure decision-maker while everything else gets handled. That's not a threat to the experienced engineer. It's the job finally becoming what it always should have been. I don't know where it lands in another two years, and I'm suspicious of anyone who claims to. But I know that the gap between engineers who are actively adapting to each new stage and engineers who aren't is widening — not slowly, but at the same accelerating pace as the technology itself.
The assistant is gone. The collaborator is here. And the next version of it is already closer than you think.