Insights

Claude Workflows Got the Architecture Right. The Bill Is the Open Question.

Author: Mandelbulb Technologies

9 MIN READ · June 2, 2026

Anthropic moved agent orchestration out of the model's context window and into JavaScript you can read, the same pattern we've hand-rolled in LangGraph for two years. What's genuinely new, what we found running it on our own code, and the one number that should make you read the docs before you type "ultracode."

The first dynamic workflow we ran inside Claude Code cost 1.4 million tokens and answered a question one prompt could have answered. The second one found three bugs in a migration plan that four of our senior engineers had already signed off on.

Both sentences are true. The distance between them is the whole story of this feature.

Anthropic shipped dynamic workflows on 28 May 2026, alongside Claude Opus 4.8. The headline number, up to a thousand subagents in a single run, is the part that travels on Twitter and the part that matters least. The part that matters is architectural, it is not new to anyone who has built production multi-agent systems, and it is the reason we think this is the most important agent-infrastructure release of the year despite shipping with a cost model that can set money on fire.

What Anthropic actually shipped

Strip away the launch copy and the mechanism is simple. You describe a task. Instead of answering it directly, Claude writes a JavaScript script that defines the work (phases, parallel fan-out, loops, verification passes) and hands that script to a background runtime. The runtime spawns subagents, each with its own context window and a narrow job. Their outputs flow back into the script as plain variables. Only the final, reconciled answer returns to your conversation.

The limits, for the record: up to 16 subagents running concurrently, up to 1,000 across a single run. The script orchestrates; it can't touch the filesystem or the shell. Only the agents it spawns can. You trigger it by putting the word "workflow" in a prompt, by asking for one directly, or by raising the effort to what Anthropic calls ultracode, at which point Claude starts reaching for workflows on its own. On Max, Team, and API access it's on by default.

Anthropic's own documentation does not bury the catch. "Dynamic workflows can consume substantially more tokens than a typical Claude Code session," it reads, with a recommendation to "start on a scoped task to get a feel for usage." Read that sentence twice. We'll come back to it.

The part that's actually new, and it isn't the agent count

For two years the default way to build a multi-agent system was to make the language model the orchestrator. The model decides which sub-task to run next, reads each result back into its own context, then decides again. It works in a demo. It rots in production, for a reason that is mechanical rather than mysterious: every intermediate result the orchestrator reads is tokens it now carries for the rest of the run. Ten subagents reporting back into one context window is ten times the pollution, and the orchestrator gets slower, more expensive, and more forgetful exactly as the task gets bigger.

Move the orchestration into code and that tax disappears. The loop, the branching, the intermediate results all live in script variables the model never has to hold in its head.

LLM as orchestrator                Workflow: code as orchestrator

   prompt                              prompt
     │                                   │
     ▼                                   ▼
 ┌──────────┐  result          Claude writes a script
 │  model   │◄──────────┐               │
 │ (context │  result   │               ▼
 │  fills   │◄────────┐ │           runtime ──► agent ─┐
 │  up with │         │ │                   ──► agent ─┤─► script
 │  every   │◄──────┐ │ │                   ──► agent ─┘   variables
 │  result) │       │ │ │                                     │
 └──────────┘       │ │ │           only the final  ◄─────────┘
   nothing leaves ──┘─┘─┘           answer returns

The principle is old and unglamorous: use code for what code is good at (control flow, state, determinism) and models for what models are good at (judgment, one bounded step at a time). We didn't need a launch post to believe this. The orchestration spine behind ourConstruction Control Tower deployment is exactly this shape: a deterministic plan in LangGraph, with model-driven subagents called only for the steps that genuinely need reasoning. We wrote, in a field note on industrial retrieval, that the free-running ReAct version "worked beautifully on demos and broke beautifully in production." Dynamic workflows are Anthropic making that lesson a first-class feature. That is the news. The thousand agents are a consequence, not the point.

What happened when we ran it on our own work

So we stopped reading about it and pointed a workflow at our own work. The run from the opening was the proof: we asked one to stress-test a migration plan our team had already reviewed and approved. It ran several agents in parallel, each with a clean context and a single lens (coupling risk, regression gaps, ordering hazards, business-logic edge cases), and a separate pass that tried to refute each finding before it surfaced. It came back with three real problems four senior engineers had signed off past.

It was better than a single agent, and the reason is not speed. It's independence. A single agent reasoning turn by turn carries its earlier conclusions forward and quietly talks itself out of the awkward ones. Agents that never see each other's context don't share that bias, and the adversarial pass killed two confident-but-wrong findings before they reached us. Parallel breadth, then refutation. That is the part of this we'd keep even if nothing else improved.

The cautionary run came next. We pointed a workflow at a different migration and asked it to draft and partially execute the change, with our test suite as the bar. It produced a large volume of plausible, test-passing edits, and a careful reviewer would not have signed off on a meaningful fraction of them. The tests were green. The code wasn't good. It had cleared the gate we set without doing the thing we actually wanted, because the gate we set was weaker than our own judgement.

That is the whole lesson, and it is load-bearing: a workflow is only ever as good as the bar you set inside it. Tests passing and code being good are different claims. Point a thousand agents at a weak verification gate and all you've bought is more output that clears a weak gate. Faster, and at a price.

The bill

Now the sentence we told you to read twice. Two days after launch, a developer on r/ClaudeAI watched ultracode spin into what the main agent later admitted was a "degenerate loop": 1.7 million tokens in minutes, zero usable output. There was no spending cap and no circuit breaker to stop it. Anthropic's terms don't refund token consumption, whatever the feature did to earn it. The community names for this are not affectionate: "money printer," "token black hole," "dynamic bills."

The expense is structural, not a one-off bug. Every subagent runs on the same expensive model tier as the session that spawned it. Under ultracode, Claude fans your requests out across subagents whether or not you asked it to, and a single request can become several workflows in a row. Reports of a routine task consuming a fifth to nearly half of a weekly limit in minutes are common enough that the pattern is the story, not the outlier. For an enterprise on an API contract this is a line item to forecast. For a solo developer on a Pro plan it can be a Tuesday-afternoon mistake with a real cost and no undo.

This is the half of the release that is not ready. The architecture is right; the controls around it are early. There is no native per-workflow budget you can hard-cap, no spend circuit breaker, and an opt-in surface (a single word in a prompt, an effort setting that fans out by default) loose enough to trigger spend you never intended.

How we actually use it

First, we turned ultracode's automatic fan-out off. We do not let the harness decide on its own that a task deserves a thousand agents. Beyond that, three rules, none of them clever:

  1. Scope it small first. Run it on one service, one directory, one bounded question before you point it at the monorepo. That's Anthropic's own advice, and it's correct.

  2. Read the script before you approve it. The entire point of orchestration-as-code is that the orchestration is readable. The plan-approval prompt shows you the JavaScript. Read it. If it spawns 200 agents to answer something that wants 20, you'll see it before you pay for it.

  3. Set the bar inside the workflow as if you don't trust the output, because you shouldn't, yet. Adversarial verification, independent cross-checks, a real test gate. Same lesson as our migration run: scale amplifies whatever bar you set, including a bad one.

And the decision that comes before all three, whether to reach for a workflow at all:

Reach for a workflow when…

Reach for one prompt when…

The work is genuinely fan-out shaped: a bug sweep across a whole service, a migration touching hundreds of files, a question that needs many sources cross-checked.

It's one file, one function, one well-scoped change. A workflow's setup cost buys you nothing.

You want the orchestration written down as a script you can read, rerun, and audit.

It's exploratory and you'll change direction every few minutes anyway.

The task is high-stakes enough that several independent angles and an adversarial review are worth paying for.

Speed matters more than coverage and a single careful pass is good enough.

You can state a hard verification bar (a test suite, a build, a spec) that the result must clear.

There's no objective bar, so more agents just means more confident guesses.

The takeaway

Dynamic workflows are the correct architecture wearing an early-stage price tag. The idea underneath, that orchestration belongs in code and judgment belongs in the model, is not a research preview. It's how durable agentic systems get built, and the teams shipping production AI agents have been converging on it from every direction for two years. Anthropic just made it a button.

The button is currently wired to a meter with no ceiling. That will change: budgets, circuit breakers, tighter opt-in are the obvious next commits, and Anthropic moves fast. Until it does, the feature rewards exactly the discipline good engineering always rewarded: scope the work, read the plan, set a hard bar, and don't confuse a thousand agents with a thousand good decisions.


Use code for what code is good at, and models for what models are good at. The rest is just deciding what you're willing to sign your name to.

Newsletter

Get the next essay in your inbox

Monthly insights on enterprise AI, product updates, and field notes from our deployments.

Put this thinking to work on your own operation. Run the free 2 minute AI Opportunity Scan.