# The orchestration tax

Running 20 AI agents feels like leverage. It looks like leverage. The dashboard is full, everything moves, and you're never idle. But your cognitive bandwidth doesn't parallelize. Every agent's output still has to route through exactly one serial processor: you. The gap between what your agents produce and what you can actually review, understand, and merge is the **orchestration tax**. And the only way to reduce it is to start architecting your own attention the same way you'd architect any concurrent system.

I was on a panel at Google I/O with Richard Seroter, Aja Hammerly, and Ciera Jaspan talking about what software engineering looks like right now and where it's headed. Near the end, Richard asked what's one thing developers should walk away and do differently. I said the thing I'd been circling for months: feeling busy is not the same as being productive. You can run 20 agents and feel completely busy. That's not 20 agents' worth of shipped work.

Richard named the pattern during that same conversation. "You talked about the orchestration tax," he said. "You can't manage twenty agents successfully in your own brain." He's right. I want to break this down properly, because it's not a discipline problem. It's an architecture problem.

The line from the panel I keep thinking about is one I said almost randomly: *running multiple agents does not mean there is more of you.*

## The asymmetry nobody prices in

There's a hidden asymmetry in agentic workflows. Starting an agent is cheap — a keystroke, a sentence prompt. Closing the loop on one isn't cheap at all. Someone has to check whether what came back is correct and reconcile it with whatever the other agents touched. That someone is you. There's exactly one of you.

I wrote about a piece of this in an earlier post ("Your parallel agent limit"), mostly about the ambient anxiety of not knowing which parallel thread is quietly failing. This is about the shape underneath that cost. When you start seeing agent development as a concurrent system, you realize the human is just a component inside it.

The slow, serial component.

## You're the GIL

If you've written concurrent code, you already have the right intuition. You've just been pointing it at the wrong part of the system.

Python has the **Global Interpreter Lock** (GIL). You can spawn as many threads as you want, but only one executes Python bytecode at a time, because they all must acquire the lock. You are the GIL of your AI agents. They can all run at once. But when any of their work needs genuine architectural judgment or conflict resolution, that work has to acquire the lock. There's one lock. You hold it.

**Amdahl's Law** makes this precise. The speedup you get from parallelizing is capped by the fraction of work that stays serial. If a large chunk of your pipeline can't be parallelized, you hit a hard ceiling no matter how many cores you add. In agent development, the serial fraction is the judgment. Spawning 8 agents doesn't speed up your judgment time. It just makes the queue feeding into it much deeper.

This is an old performance engineering fact that still catches people off guard: optimizing a non-bottleneck doesn't increase throughput. It grows the pile of unfinished work sitting in front of the bottleneck. Adding agents optimizes the part that was never the constraint. The constraint is the review step, and the throughput of your system equals exactly the throughput of that step.

The orchestration tax is the structural gap between what your agents produce and what you can actually merge. It's what happens when you put a single-threaded resource in charge of a concurrent system.

## Grinding doesn't fix structural limits

At the panel I said I've never felt more productive with my tools, but I'm also more tired than I've ever been. Both are completely real, and they have the same cause.

The tiredness has a specific shape. It's what running a serial processor at 100% with no slack feels like. Every time you check on an agent you've been away from, you pay a context-switch cost. You flush your mental state and reload a different context from cold. CPUs do this in microseconds, and architects still work hard to avoid it. You do it in minutes, and you never reload perfectly. Five agents isn't 1x workload done five times. It's five cold reloads plus a background brain process constantly monitoring which agent needs attention next.

You can't grind your way out of a structural limit. The tax gets paid either way. Push harder and the limit shows up as shallow code reviews, or as what I'd call **cognitive surrender** — accepting the agent's output because forming your own opinion costs attention you no longer have. You either pay the tax deliberately, by designing around it, or you let it quietly degrade your understanding of your own system.

## Architect your attention

Your attention is a scarce serial resource. You wouldn't design a distributed system without thinking carefully about the bottleneck. Your brain deserves the same treatment.

A few things that have actually held up for me:

**Scale the fleet to your review rate, not to the UI.** A well-designed concurrent system uses backpressure so the queue doesn't grow without bound — the producer slows down to match the consumer. Your agent count is the producer; your review rate is the consumer. The right number of parallel agents is however many you can actually code-review properly. For most people, that's a low single digit. The tool will happily let you spawn 20, but that's a UI feature, not a throughput recommendation.

**Sort the work before you delegate it.** I keep two piles of tasks. One is isolated work I'm happy to hand off to background agents running in the cloud — these can run async and often only need me at the final gate. The other is complex tasks where the judgment *is* the work: a weird bug, an architectural decision, a refactor with non-obvious tradeoffs. The mistake is trying to parallelize the second pile. Running multiple complex tasks simultaneously doesn't scale your output. It thrashes the lock and everything comes out worse.

**Batch your reviews.** Context switching is expensive every time. Reviewing four agents in one sitting costs far less than checking one, doing something else, and returning cold. Give agents a longer leash. Let the work accumulate, then process the batch.

**Only spend the lock on judgment.** Don't burn your attention on things the machine can verify itself. Have the agent write a passing test or generate a screenshot. Let it prove the straightforward 80% independently, so your scarce focus goes to the 20% that genuinely needs a human.

**Protect your serial time.** The bottleneck needs your best hours, not the leftover minutes between agent check-ins. Sometimes the highest-leverage move is to stop orchestrating entirely — close the laptop full of running agents and think hard about one problem with the lock held the whole time. Orchestrating is overhead. It's not the real work.

Aja made the point on the panel that architecture is the urgent skill right now: knowing what belongs inside one agent and what's too much for it. I'd add that you're a component in that system. Your attention has a known, low serial throughput. The system either respects that number, or it routes around it by quietly lowering your standards.

## The failure mode is invisible from the inside

Twenty running agents produces a feeling of massive productivity. The dashboard is full, everything moves. But that feeling is decoupled from actually shipping good code to main. You can be maximally busy and barely produce anything. From the inside, it feels identical.

Ciera brought up Margaret-Anne Storey's work on debt. We talked about technical debt and cognitive debt together. The orchestration tax left unpaid is how you accumulate both at once. You merge things you didn't read carefully. Your mental model of the codebase goes stale. None of this appears on any dashboard today. It shows up when production breaks and you look at the system and realize you have no idea how it works anymore.

Spawning agents isn't the skill. Anyone can run 20.

The skill is designing the system around the one resource that can't be cloned or parallelized. Architect your attention the way you'd architect anything else you depend on in production.

@addyosmani

The orchestration tax

Spawning AI agents is cheap. Closing the loop on them isn't. Addy Osmani makes the case that you're the GIL of your own agent fleet — a single-threaded resource in a concurrent system — and that fixing it is an architecture problem, not a discipline one.

Addy Osmani @addyosmani

Article

7 min read

Key Findings

• The orchestration tax is a structural problem, not a discipline problem: your cognitive bandwidth is single-threaded, and every agent's output must route through that one serial processor before it can ship.
• Amdahl's Law applies directly to agentic workflows — adding more agents doesn't increase throughput when human judgment is the bottleneck; it only deepens the queue feeding into it.
• The right agent count isn't determined by what the tool allows, but by how many you can actually code-review properly — for most developers, that's a low single digit.
• Cognitive surrender is the hidden cost of an overloaded review bottleneck: you accept agent output not because it's correct, but because forming your own judgment costs attention you no longer have.
• Sorting tasks before delegation is the highest-leverage intervention: isolated, async-safe tasks can run in parallel with minimal lock contention; complex judgment-heavy tasks should run serially or not at all.

Running 20 AI agents feels like leverage. It looks like leverage. The dashboard is full, everything moves, and you're never idle. But your cognitive bandwidth doesn't parallelize. Every agent's output still has to route through exactly one serial processor: you. The gap between what your agents produce and what you can actually review, understand, and merge is the orchestration tax. And the only way to reduce it is to start architecting your own attention the same way you'd architect any concurrent system.

The line from the panel I keep thinking about is one I said almost randomly: running multiple agents does not mean there is more of you.

The asymmetry nobody prices in

The slow, serial component.

You're the GIL

If you've written concurrent code, you already have the right intuition. You've just been pointing it at the wrong part of the system.

Python has the Global Interpreter Lock (GIL). You can spawn as many threads as you want, but only one executes Python bytecode at a time, because they all must acquire the lock. You are the GIL of your AI agents. They can all run at once. But when any of their work needs genuine architectural judgment or conflict resolution, that work has to acquire the lock. There's one lock. You hold it.

Amdahl's Law makes this precise. The speedup you get from parallelizing is capped by the fraction of work that stays serial. If a large chunk of your pipeline can't be parallelized, you hit a hard ceiling no matter how many cores you add. In agent development, the serial fraction is the judgment. Spawning 8 agents doesn't speed up your judgment time. It just makes the queue feeding into it much deeper.

Grinding doesn't fix structural limits

At the panel I said I've never felt more productive with my tools, but I'm also more tired than I've ever been. Both are completely real, and they have the same cause.

You can't grind your way out of a structural limit. The tax gets paid either way. Push harder and the limit shows up as shallow code reviews, or as what I'd call cognitive surrender — accepting the agent's output because forming your own opinion costs attention you no longer have. You either pay the tax deliberately, by designing around it, or you let it quietly degrade your understanding of your own system.

Architect your attention

Your attention is a scarce serial resource. You wouldn't design a distributed system without thinking carefully about the bottleneck. Your brain deserves the same treatment.

A few things that have actually held up for me:

Scale the fleet to your review rate, not to the UI. A well-designed concurrent system uses backpressure so the queue doesn't grow without bound — the producer slows down to match the consumer. Your agent count is the producer; your review rate is the consumer. The right number of parallel agents is however many you can actually code-review properly. For most people, that's a low single digit. The tool will happily let you spawn 20, but that's a UI feature, not a throughput recommendation.

Sort the work before you delegate it. I keep two piles of tasks. One is isolated work I'm happy to hand off to background agents running in the cloud — these can run async and often only need me at the final gate. The other is complex tasks where the judgment is the work: a weird bug, an architectural decision, a refactor with non-obvious tradeoffs. The mistake is trying to parallelize the second pile. Running multiple complex tasks simultaneously doesn't scale your output. It thrashes the lock and everything comes out worse.

Batch your reviews. Context switching is expensive every time. Reviewing four agents in one sitting costs far less than checking one, doing something else, and returning cold. Give agents a longer leash. Let the work accumulate, then process the batch.

Only spend the lock on judgment. Don't burn your attention on things the machine can verify itself. Have the agent write a passing test or generate a screenshot. Let it prove the straightforward 80% independently, so your scarce focus goes to the 20% that genuinely needs a human.

Protect your serial time. The bottleneck needs your best hours, not the leftover minutes between agent check-ins. Sometimes the highest-leverage move is to stop orchestrating entirely — close the laptop full of running agents and think hard about one problem with the lock held the whole time. Orchestrating is overhead. It's not the real work.

The failure mode is invisible from the inside

Spawning agents isn't the skill. Anyone can run 20.

The skill is designing the system around the one resource that can't be cloned or parallelized. Architect your attention the way you'd architect anything else you depend on in production.

Create articles like this

Start Free →