I’m an engineer who spent most of his career doing non-technical work, and somehow I now build our apps. So I’m not a developer. I got here by chatting with AI agents night after night for months.
That’s the part nobody warns you about. The agents are good company until 4am, when you’re still babysitting a chatbot, hitting approve every thirty seconds, and the thing you actually want is sleep.
So I had a selfish goal. I wanted my nights back. A queue I could fill during the day and actually walk away from, instead of sitting next to it like a nervous parent.
I built that on top of Sandcastle, Matt Pocock‘s harness for running coding agents.
The goal was never “let an agent loose on my repo.” That’s the version everyone’s scared of, and it’s also the useless one. What I wanted was narrower:
I write the issues. The agents handle coding, review, fixes, merge, deploy checks, and unblocking the next issue, while the host runner owns the PR boundary.
AFK means “away from keyboard”. The loop takes a well-shaped GitHub issue and carries it the whole way: implement, have the host open a PR, review it, fix what the review finds, merge, check the deploy, close the issue, then go find the next one that’s ready.
The shift is small, but it changes everything. Not “agent as coding assistant.” Agent as a disciplined worker inside a queue I control.
The Shape of the Loop: From GitHub Issue to PR
Everything starts at GitHub Issues. That’s on purpose.
I write an issue with enough context that an agent can finish it without coming back to me for a product decision. If it’s genuinely ready, it gets ready-for-agent. If it still needs me, it gets ready-for-human or needs-info.
These state names, including ready-for-agent, origin:sandcastle, agent-review:needed, and agent-merge:needed, are GitHub labels on Issues and PRs.
That one label split does more work than it looks like. The agent queue isn’t a place to dump half-formed ideas and hope. It only holds work that’s already well enough shaped to hand off and forget.
Once an issue is ready, the Sandcastle issue runner picks it up from a dedicated scheduler worktree kept aligned with origin/staging, the remote staging branch, and:
- Plans the implementation of the issue.
- Implements using the
TDDskill. - Runs the repo checks it can run.
- Has the host runner opened a PR targeting the
stagingbranch. - Writes the PR link back to the issue.
- Removes the issue from the queue.
- Labels the PR for the next stage:
origin:sandcastleandagent-review:needed.
From there, a separate PR-review scheduler takes over.
Why Review Is a Separate Loop
If I had to point to the one decision that made the whole thing stable, it’s this: implementation and review are two different loops, and they do not share runtime state.
The issue loop produces PRs. The review loop reviews them. Separate worktrees, locks, logs, stop switches, labels, and failure states. They only meet at GitHub.
Sounds like overkill until something breaks. A stuck review can’t freeze implementation. A bad implementation run can’t corrupt the reviewer. And the part I care about most: review starts from a fresh context, not the same session that just wrote the code and is now quietly convinced it’s correct.
The review scheduler looks for open PRs targeting staging with agent-review:needed, exactly one origin label such as origin:sandcastle, no human-review:ready, and no active or terminal review/merge state like agent-review:in-progress, agent-review:complete, agent-review:blocked, agent-merge:needed, agent-merge:in-progress, agent-merge:complete, or agent-merge:blocked.
When it finds one, it adds agent-review:in-progress, checks out the branch in an isolated worktree, and runs the review prompt.
What the Review Agent Actually Does
The review prompt is long and deliberately bossy. The agent has to:
- Check out the PR branch.
- Update it from the branch it merges into.
- Resolve straightforward merge conflicts.
- Run
gstack-review. - Post findings to the PR.
- Fix blockers and valid non-blocking suggestions.
- Inspect open review comments and threads.
- Resolve valid comments after fixing them; comment when a thread doesn’t apply.
- Update relevant docs and agent instruction files.
- Commit and push.
- Update the PR title, description, labels, and Test Plan.
- Run every feasible test locally or via browser tooling; leave untestable checks as manual items.
That last one matters more than it sounds. The Test Plan must never claim something was tested when the machine lacked the auth, credentials, data, or environment to test it. If a check can’t run, it stays on the page as a manual item, in plain sight. Honest beats green.
The Host-Side Review Thread Gate
This is the part I’m most glad I added, and the part I’d have skipped if I were moving fast.
It’s not enough for the agent to announce, “I handled the review comments.” Agents say that. Sometimes they’re even right. So the host checks GitHub’s actual review thread state instead of taking its word for it.
Before a PR can move from review to merge, the runner queries the review threads directly. Every unresolved thread must be one of two things: resolved in GitHub or explicitly marked as not applicable.
Anything else, and the PR is blocked. No agent-review:complete, no agent-merge:needed. It gets parked in a blocked state with links straight to the threads still open.
That gate quietly changed how the whole thing feels. The review went from something the model claims it did to something the host can prove.
Merge Is Also a State Machine
If the review passes, the PR moves to agent-review:complete and agent-merge:needed, and the merge phase runs a saved land-and-deploy prompt.
That prompt tells the agent to:
- Squash-merge the PR to the
stagingbranch. - Wait for the
stagingenvironment deployment. - Investigate failed deploys.
- Raise a follow-up PR if the environment branch needs a fix.
- Run the remaining Test Plan items against the deployed environment.
- Update the Test Plan and merge labels accurately.
Merge moves from agent-merge:needed to agent-merge:in-progress to agent-merge:complete, or lands on agent-merge:blocked when something goes wrong.
The agent can merge through GitHub once the gate passes. It can’t push to staging directly. Ever. Even fully autonomous work goes through a PR, because the day I let an agent skip that is the day I lose track of what actually shipped.
Closing Issues and Unblocking the Next Work
After a tracked PR merges, the host runner goes back and reconciles the original issue.
If every implementation PR on that issue has merged, the issue closes with a comment listing what shipped. Then the scheduler looks around at the other open issues.
If one of them was blocked by the issue that just shipped, and it has no other blockers left, the runner can promote it to ready-for-agent. It never touches anything labeled ready-for-human. That line is bright on purpose. The agent gets to keep the queue moving; it doesn’t get to overrule me on what’s ready.
So the thing settles into a rhythm:
- Human writes issues.
- Agent implements one.
- Agent reviews the PR.
- Agent merges to
staging. - Host closes the issue.
- Host unblocks the next.
- Agent continues.
Why It Runs Locally First (Not GitHub Actions)
Right now it’s all local. The schedulers drive Codex from dedicated worktrees on my machine, not from a hosted runner or GitHub Actions.
I know that sounds less impressive. It’s still the right call for now, because my laptop already has the things this loop actually needs:
- Codex, the agent runtime that executes each step
- gstack and browser tooling
- auth state for
stagingchecks - repo-specific skills
- local logs and worktrees
- a stop switch I can touch immediately
Before I turn any of this into a GitHub App or a hosted service, I want it to get boring locally. No surprises, no 4am “wait, what did it just do?” Boring is the whole goal. Boring means I can trust it.
The Main Lesson: Structure Beats Codegen
Here’s what building this actually taught me: getting an agent to write code is the easy part. It was the easy part a year ago.
The hard part is everything around the code: enough structure that the work stays legible, reviewable, recoverable, and safe to leave running while you’re not watching.
Labels turned out to be the spine of it. Every issue and every PR only moves by changing a label a human or the host can read. No hidden state, no “trust me.”
The pieces that mattered, if you’re building your own:
- a strict issue quality bar
- labels as the state machine
- separate implementation and review schedulers
- dedicated scheduler and review worktrees aligned with
staging - serial execution by default
- saved prompts for review and merge
- explicit PR Test Plans
- host-side review-thread verification
- no direct pushes to
staging - issue closure only after merged-PR verification
Sandcastle handed me the base loop. The rest was turning that loop into an operating model: labels, gates, prompts, worktrees, and clear lines about who’s allowed to do what.
I’m still tightening it, and I’ll probably be tightening it for a while. But the shape feels right. The human stays in charge of what gets built. The agents take well-written issues and walk them, carefully, all the way to staging. It’s the same instinct behind moving repetitive work off humans and onto machines: say what you want once, and let the system handle the repetitive parts.