jeremy.runtime
jeremy@agent: /write agent-driven-development-needs-a-control-plane

13 May 2026

Agent-Driven Development Needs A Control Plane

The useful version of agent-led development is not one agent writing code. It is the operating system around agents, reviews, PRs, tests, rollbacks, and human attention.

Agent-driven development starts with a seductive idea: give an agent a task, wait, and get a working feature back.

That works often enough to be exciting. It also fails often enough to teach the real lesson.

The valuable part is not a single agent producing a diff. The valuable part is the control plane around the work: how tasks are shaped, how responsibilities are split, how changes are reviewed, how tests run, how risky work is stopped, how low-risk work keeps moving, and how the human stays in the loop without becoming a full-time merge queue.

My early experiments were simple. One agent, one task, one repo. That was enough to feel the leverage, but it was also enough to expose the failure modes. The agent could make progress, but it could also wander, over-edit, miss context, or produce something plausible that needed a second engineering pass. The work was not yet a system.

The next layer was plugins and workflow helpers. Some of those did not work at all. That was useful information. A plugin that sounds powerful but does not fit the actual development loop becomes another surface area to supervise. If the tool cannot create clearer handoffs, better state, or more reviewable changes, it is not improving the system.

The more useful direction was role separation.

An architect role can clarify intent, break work into slices, identify risks, and decide what belongs in the first PR. A coder role can implement a bounded change. A reviewer role can look for regressions, missing tests, bad abstractions, and places where the diff does not match the goal. The point is not theater. The point is to separate different kinds of judgment so the system has more chances to catch itself.

The harness matters too. I have moved between Claude Code, Codex, OpenClaw, custom skills, and local workflow glue because the agent is only one layer. The surrounding harness decides whether the work becomes a readable branch, a useful PR description, a test run, a browser verification, or just a blob of code that someone has to rescue.

The strongest pattern so far is small PRs with explicit review loops.

Agents are good at generating work faster than I can review it. That creates a new bottleneck: stacks of PRs waiting for human attention. The solution is not to pretend review does not matter. The solution is to triage better.

Some PRs should get deep review because they touch user data, payments, authentication, execution boundaries, or product behavior that could cause harm. Some PRs should get lightweight review because they are copy, layout, docs, test fixtures, or narrow refactors with good coverage. Some PRs should be automatically rejected because they are too broad, under-tested, or unclear.

That is where review agents, babysitting skills, and scheduled automation become useful. A babysitting loop can watch CI, request or perform review, summarize failures, apply narrow fixes, and keep a PR from going stale. Cron-driven agents can continue routine work without waiting for me to remember every thread. The system can nag me when judgment is needed and keep moving when judgment is not needed.

This points toward a more mature development model:

  • Low-risk PRs can eventually merge with no human in the loop when the policy, tests, ownership, and rollback path are clear.
  • Medium-risk PRs should get a crisp summary, screenshots or recorded flows, test evidence, and a focused review checklist.
  • High-risk PRs should force human review and make the risk obvious before the reviewer opens the diff.
  • Rollback testing should be part of the feature design, not an afterthought.
  • Feature flags should be common when agents are changing product behavior.
  • The system should optimize for human attention, not just code throughput.

Recorded videos and better explanations matter more than they sound. A reviewer does not only need the diff. They need to understand the intended behavior, the path tested, the failure mode considered, and what changed from the user’s point of view. Agents can help produce that evidence if the workflow asks for it.

This is also why agentic development belongs close to product work. Swoleby is useful because it is real enough to have consequences but low stakes enough to experiment aggressively. It has SMS, auth, payments, reminders, dashboards, AI coaching, user state, and deployment. That makes it a good lab for agent-led practices without using enterprise production systems as the test subject.

OpenClaw is the other side of the same work. It is where the workflow itself becomes the product surface: skills, slash commands, agent teams, review loops, execution constraints, and automation that keeps the system honest.

The frontier is not “can an agent write code?”

The frontier is whether a team can build a development control plane where agents produce steady, reviewable, reversible progress. That means better task decomposition, better evidence, better CI, better review automation, better risk policy, and a clear path from suggestion to safe merge.

Agent-driven development is going to be less like hiring one tireless junior engineer and more like designing an operating system for software work.