Most discussions of AI agents still assume the interesting question is how much one agent can do.
Can it plan, implement, test, review, and merge? Can it hold more context? Can it run longer? Can it keep enough memory to become a real teammate?
That is the wrong center of gravity.
The useful shift in serious AI work is not one stronger agent. It is role separation. One layer scopes and governs the work. Another executes against a contract. A reviewer decides whether the result stands. Once you separate those roles, the system becomes inspectable. Until then, you have a chat loop with side effects.
I call that governing layer the AI control plane.
What the control plane is for
The control plane is not the system that writes the code or drafts the prose. It is the layer that decides what work exists, what the boundaries are, which tool should do it, what evidence counts as done, and how the result gets audited afterward.
In practice that means things like triage, scope clarification, prompt design, verification gates, artifact capture, and closure discipline. The point is not bureaucracy. The point is that execution is not the same job as governance.
Software teams already understand this separation in other forms. Architecture is not implementation. Implementation is not review. Review is not release management. Agentic systems do not remove those distinctions. They make them more important.
The concrete version I actually use
The control plane became real for me when prompt design stopped living in chat history and started living next to the work item itself.
The pattern is simple. The issue body defines the problem and acceptance
criteria. A ## BUILD PROMPT comment carries the execution contract. The agent
reads that contract and works the issue. The PR carries the diff. Verification
results and review determine whether the issue actually closes.
The prompt stopped being a private conversation artifact and became part of the operational record. The contract lived with the work item. Revisions to the prompt stayed attached to the issue instead of scattering across notes, conversation history, and orphaned prompt files. When something went wrong, I could inspect the contract, not just the code.
That pattern eventually became its own post, GitHub Issues as Ephemeral Prompt Storage. What that earlier post captured was one concrete convention. What mattered underneath it was the larger architectural lesson: the prompt was not just instruction text. It was a control-plane artifact.
Why this is infrastructure, not prompt craft
Most agent discourse still overweights prompt wording and underweights system design. Prompt quality matters, but it is downstream of the harder questions.
Where does state live? Who is allowed to change it? What artifact represents a handoff? What evidence is required before a task counts as complete? Which parts of the system are durable, and which are ephemeral? When a run fails, can you reconstruct what was asked, what was attempted, and why it was accepted or rejected?
Those are infrastructure questions. If you cannot answer them, you do not have an operational agent system. You have an execution surface with weak governance.
The failure mode this fixes
When teams collapse planning, execution, review, and closure into one agentic loop, several problems show up at once.
Scope drifts because the same system that executes the task is also free to reinterpret it. Prompt revisions disappear into chat history. Review turns into vibe-checking because there is no explicit contract to test against. Forensics get harder because you have code and maybe a PR, but no durable record of the instruction that produced them.
This is why βthe agent did something weirdβ is often a systems bug disguised as a model complaint. The model may have made a bad choice, but the architecture also failed to constrain the role, preserve the contract, or enforce the closure gate.
A minimal control-plane loop
The smallest useful loop I have found looks like this:
- Capture the work item
- Clarify scope and acceptance criteria
- Generate an execution contract
- Hand execution to the right agent or tool
- Run verification
- Review the result against the contract
- Close only when the evidence is sufficient
Nothing in that loop requires exotic tooling. Most of it can be done with issues, comments, branches, PRs, and a handful of conventions. What matters is not the platform. What matters is that planning, execution, and governance are not all being performed by the same opaque loop.
What this makes visible
Once you separate planning, execution, and governance, other problems come into focus more clearly. Privacy stops looking like a settings problem and starts looking like a boundary problem. Memory stops looking like a convenience feature and starts looking like infrastructure. Both are easier to see once the control plane exists as a distinct layer rather than as hidden logic inside one agentic loop.
Where this goes next
I do not think every serious AI system will look like GitHub issues plus build prompt comments. That is my current implementation, not the universal form.
You can see the industry circling this now. Anthropic is distinguishing planning and auto-accept modes in Claude Code. OpenAI keeps talking about harnesses, controlled execution, and repository knowledge as the system of record. Frameworks like CrewAI package orchestration, flows, and state as first-class concerns. The names vary. The structural need does not.
What I do think is that serious systems will keep rediscovering the same structural need: some explicit layer has to separate planning, execution, and governance. Without that layer, the system remains impressive but brittle.
If your current setup still assumes one agent should own the whole loop, this is the first design question I would ask:
What should this agent stop being responsible for?
AI Attribution