Superset — managing coding agents in parallel

Right now at Superset, we're able to reliably manage 5-7 coding agents in parallel - whether that's Claude Code, Codex, etc. - at a time. Our goal is to be able to manage 100 coding agents in parallel each by the end of 2026.

Most people believe that the path from seven to 100 agents is better models, faster inference, and smarter agents. It's not. Agent compute is already cheap enough, you can run hundreds of agents a month all for less than the cost of one engineer.

What's stopping us is every agent needs a human to review its code, give feedback, and decide what to work on next. Scale the agents all you want - it's the humans that don't scale.

Mapping out the problem

You can imagine the agent loop as a pipeline, and the goal is to improve throughput:

The agent pipeline — most steps require a human

There's a clear bottleneck that emerges when you look at this. A human is involved in almost every step, and each of these steps has a steep context-switching cost - you have to open that agent's code, spin up dev servers, click through the UI to verify their work, give feedback and more. Right now, most of our agents spend more time waiting for us to review their work than they spend doing it.

At 100 agents, this model completely breaks. You can't review 100 diffs a day. You can't context-switch between 100 streams of work.

The fix is straightforward: pull the human out of steps where they're not needed, and make the remaining steps faster.

How we'll improve it

Have agents work harder before reaching out to you

If you've worked with a coding agent, you've had the experience: the agent comes back with something half-baked, you spend 15 minutes catching up to what it did, spin up a dev server, click around, then feed it the same feedback you've given a dozen agents before. Most of the time you spend reviewing isn't making decisions — it's catching problems that should have been caught before the work reached you.

The fix is adding layers between the agent and you. The agent's work should be vetted thoroughly before it ever is presented to you.

Agent work passes through review layers before reaching you

Adversarial agents

Block published a paper recently that highlights how useful having agents work together can be. The general idea is that they send two agents on tasks, one to implement the task at hand, and the other to enforce the implementer to write tests, review their work, and do due diligence before picking a solution.

A similar pattern can be used to reduce interruptions for you: you could have a dedicated bouncer agent that sits between the coding agents and you, preventing agents from surfacing its work until it's sure the agent is either done with its work or is sufficiently stuck. Your review becomes a final sign-off, not a first pass.

Stacking review agents and automated testing

Since you don't care how long an agent takes when you're running dozens of them, there's no downside to stacking checks. Run five different review agents, each looking for different classes of issues, with a final agent consolidating the feedback. Each layer increases the odds that problems are found and resolved before you ever see the code.

The same logic applies to testing. Giving agents access to the browser through tools like BrowserUse or Maestro tests lets them verify their own work visually — catching UI regressions, layout issues, and interaction bugs that are invisible in code review alone.

Long-running agents

Most agent workflows today are one-shot: you give a task, the agent works, it comes back. But agents should be able to run longer loops — trying an approach, hitting an issue, adjusting, and iterating until they're confident or genuinely stuck. Ralph loops are a popular pattern for this: treat the agent's work as clay on a wheel, refining iteratively, rather than laying bricks in a line.

The result is fewer interruptions and higher-quality output when the agent does surface. An agent that's been iterating for an hour and is confident in its solution is far easier to review than one that gave up after its first attempt.

Make it fast to review agents' work

Most developer tools today are human-driven — you open a diff, you spin up a dev server, you navigate to the right page. Agents plug into these tools, but the human is still doing the legwork. We want to shift the paradigm towards agent-driven UIs - interfaces that agents orchestrate for the human's benefit, where each review takes seconds, not minutes.

Investing in agent-driven UIs

When you review an agent's work today, you're dropped into a diff with no context. You have to reconstruct what the agent was trying to do, spin up an environment to test it, and navigate to the right pages to verify. That's the agent dumping its work on your desk.

In an agent-driven UI, the agent prepares your review for you. It writes a summary of what changed and why, spins up a preview environment, navigates you to the specific pages or flows it wants you to look at, and surfaces the test results that matter. When you open a completed task, you should be looking at a prepared briefing, not raw output.

Make existing tools better

PR reviews, CI dashboards, IDEs — these are all built for a world where humans drive the interactions. In an agent-first world, the tools need to meet you differently. Agents should be annotating their own PRs before you open them, the way Devin's review adds context to diffs ahead of time. CI results should be summarized and triaged by an agent, not presented as a raw log for you to parse. The tools we use every day were designed for human authors — adapting them for human reviewers of agent work is a different design problem.

Reducing friction to zero

Every interaction between you and an agent should be as lightweight as possible. You should be able to click yes or no for straightforward changes. Agents should prep multiple-choice questions — "I found three approaches to this, which do you prefer?" — so you're choosing instead of typing. When an agent does need written feedback, supporting agents can prefill a draft response based on the context, so you're editing instead of writing from scratch. Quick actions like "create PR" or "deploy to staging" should also be easy to reach.

The goal isn't just faster review — it's making the interaction so lightweight that you can do it from your phone between meetings.

Have agents be more proactive

Events trigger agents automatically

Everything above assumes you're the one deciding what agents work on. But at 100 agents, planning is itself a bottleneck. You can't spec out 100 tasks a day — that requires understanding the codebase, the product priorities, and the nuances of each task.

Reusable workflows

The building blocks for this are already emerging. OpenAI's Codex skills let you package repeatable workflows — deploy procedures, migration steps, test patterns — as reusable bundles that agents can invoke on their own when the situation matches. Instead of writing the same instructions every time, you encode them once and the agent recognizes when to apply them.

Event-driven triggers

Devin's workflows take this further with event-driven triggers. A build fails, and a Devin instance spins up to investigate. A Linear ticket is created, and an agent starts working on it automatically. Teams create playbooks for recurring tasks — setting up changelogs, running code migrations, adding test coverage — that agents execute on a schedule or in response to events without anyone initiating them.

Beyond code

Even outside of code, this pattern is taking hold. Circleback listens to your meetings and doesn't just take notes — it extracts action items, creates Linear tickets for feature requests mentioned in product demos, and updates your CRM after sales calls. The meeting ends and the downstream work is already in motion.

We don't have all of this figured out yet. Some of it is live, some is on our roadmap, and some is still taking shape. But the throughput framing gives us a clear test for every feature we build: does this reduce the time a human spends per agent interaction?

If you're running agents at scale and hitting these walls, we'd love to compare notes, reach out to us at founders@superset.sh

Our plan for running 100 Parallel Coding Agents