Insights
Jun 4, 2026

Your Agents Aren't Making More Work. Your Harness Is.

Many teams are maintaining layers of prompts, skills, and assumptions that models have already outgrown.

Posted by:
Ben Sharpe
Partner

We made a decision early on with CodeBake that looked, at the time, like we were being lazy. We didn't write clever prompts into the product. We didn't ship a library of skills telling the AI exactly how to read a board, phrase a status update, or decide what counts as done. People asked us about it. Where's the prompt engineering? Where's the special sauce?

There wasn't any, on purpose. We bet that the models were going to keep getting better, and that anything we hard-coded to compensate for their weaknesses would turn into dead weight the moment those weaknesses got fixed. So we kept the surface thin and let the model do the work.

That bet has paid off in a way I didn't fully expect. We haven't touched the core of how agents use CodeBake in months. But their usage keeps getting better, because the models behind them keep getting better. The same thin interface that was fine a year ago is now great, and we didn't lift a finger to make that happen. Less work for us over time, not more.

I'm telling you this because Dan Shipper just published a long, sharp essay called After Automation arguing the opposite: that automation creates more human work, not less, as a kind of law. I think he's describing something real and drawing the wrong conclusion from it. And I think our boring little decision about prompts is the tell for why.

What Shipper gets right, and where it goes sideways

His observation is honest: his company automated everything it could, and the calendar didn't empty out, it filled up. From there he builds a structural argument. Cheap AI floods the world with sameness, sameness creates demand for difference, and only a human "framer" can supply the difference, so there's always more expert work for people to do, right through AGI.

It's a good argument. But read his own evidence as an engineer instead of an economist. He's at inbox zero and still reviews 95% of the emails the AI answered. He gave every employee an agent, then pulled the program back because the agents went stale and needed constant tending. One PowerPoint automation runs on 24 skills and 18 scripts. He employs a team of AI engineers whose job is keeping agents from rotting.

That's not a new law of automation. That's a lot of scaffolding built for models that don't exist anymore, and a team kept busy maintaining it. The work is real. The cause isn't economics. It's stale systems.

Why the scaffolding goes bad

Eighteen months ago, getting good work out of a coding agent took real effort. You repeated instructions because the model lost the thread. You wrote long skill files because it didn't know your conventions. You added prompts to nudge it when it quit early. None of that was foolish. It was an honest response to the model you actually had.

The problem is that every one of those additions is a guess about what the model can't do, and the guess expires without telling you. Anthropic's own engineering team said it plainly in their write-up on harness design: every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions can quickly go stale as models improve. Their advice is to keep stress-testing those assumptions, because a lot of them turn out to be wrong or out of date.

Here's the part that bites you: nobody removes scaffolding. Adding a skill is easy and feels safe. Deleting one feels risky, so it never happens. Your stack slowly settles to the level of the weakest model you ever ran, instead of the best one you have now. And past a certain point the extra instructions don't just sit there harmlessly: they crowd the context, bury the instructions that matter, and make the model perform worse than it would have with less. You end up paying to make a good model act like an old one.

This is exactly the trap we avoided by not building the scaffolding in the first place. We had nothing to prune, so there was nothing to go stale.

Someone who measured it

You don't have to take my word for the direction this runs. Nick Nisi, an engineer at WorkOS, auto-generated about 10,000 lines of "skills" from his documentation to orient his agents. Reasonable thing to try. More context, better grounding.

It dropped his agents' accuracy on key tasks from 97% to 77%.

When he cut those skills by 95%, down to 553 lines of targeted gotchas, accuracy went back up to 97%. The best practice was costing him twenty points. The fix was the delete key. And the only reason he found it was that he wrote evals and actually measured whether the context helped.

Worth noting what Nisi did keep, because it's the whole point. He didn't replace his skills with hope. He cut the instructions that told the model how to do its job, and he hardened the boundaries that defined when the job was done, enforcing conventions in code through a state machine, and verifying with a cryptographic check that the agent had really run the tests (it had been claiming passing tests it never ran). Light on instructions, strict on boundaries. That's a distinction worth holding onto.

Nate B. Jones has been making a version of this point to his audience for months: most people are still prompting like it's 2025. He thinks the answer is a set of newer, fancier skills layered on top. I think the more useful and less popular move is the opposite one —taking things away— but we agree on the diagnosis. The 2025 playbook is expired.

"But the work just moves up a level"

Shipper has a ready answer to all this: the work doesn't disappear, it moves up. You stop doing the task and start framing it. Some of that is true, and that part is genuinely valuable work.

But framing is finite. You don't re-frame the same problem every morning. Once the spec is right and the boundaries hold, the framing is done and you walk away, which is the thing I actually experience with CodeBake. I set it up once. I don't re-explain it to the models every quarter. The work ended.

Shipper's stale agents and endless review are the opposite of that. Work that never ends isn't framing. It's babysitting a system that was never finished.

How to clean it up without breaking things

If you've got an automation that feels heavier than it should, the move is to take stuff out. But not on instinct. Anthropic tried an aggressive cut of their own harness and couldn't match the original's performance. The skill that looks useless on a clean input is sometimes the one saving you on the ugly one. Cutting without checking is just a new way to break things.

So borrow the shape of Andrej Karpathy's AutoResearch tool. It runs a simple loop: the agent proposes one change, the system measures it against a fixed yardstick, keeps it if it's better, and reverts it with git if it isn't. Propose, measure, keep or revert. The difference for us is that our work doesn't come with a number the way training loss does, so you are the judge.

A version you can run this week:

  1. Put the harness in version control. Prompts, skills, instruction files, all of it. If it lives in some console with no history, fix that first. Configuration is code.
  2. Let the model audit itself. Hand a current model one automation and ask which instructions or skills a model at its level probably doesn't need anymore. It's a good critic of scaffolding built for its weaker predecessors.
  3. Remove one thing at a time. Not five. If you pull five and quality drops, you've learned nothing about which one mattered.
  4. Re-run against real past inputs, including the messy ones. A clean happy-path test proves nothing. Use the cases the automation actually sees.
  5. Compare, then decide. Look at the new output against the old, ideally without knowing which is which, and keep the cut only if quality held. Otherwise revert and move on to the next instruction.

That's it. It's not glamorous and it doesn't need to be.

Why we build CodeBake this way

This is the thesis behind the product, not a side note. CodeBake keeps tasks small and the interface thin on purpose. A small, bounded task is one you can put in version control, replay, and check. You can't do any of that against a sprawling ten-thousand-line automation, because there's nothing stable to hold onto. Keep the work small and the boundaries clear, and the agent gets more useful every time the underlying model improves, without you rebuilding anything.

Constrain the boundaries. Leave the method to the model. Then measure, and keep things lean.

Shipper is right that his people have more work to do. He's wrong about why. The models outran the playbook, and the old playbook is the thing generating the busywork. Clear out the parts the model has outgrown, carefully and with evals, and a lot of that "inevitable" work goes away. Not because the economy changed. Because you stopped fighting a model that doesn't need the fight.

The promise of automation was never more interesting work forever. It was less work. If someone tells you the new tools mysteriously make more work for you, it's worth asking what they're selling, and whether their setup is just overdue for a cleanup.