Blogs
May 2026AI in product5 min read

Opus 4.8: the benchmarks aren't the story — the harness is

Everyone's posting benchmark screenshots. The real unlock in Claude Opus 4.8 is two features they buried in the footnotes — Ultra Code and dynamic workflows — plus a quiet way to watch an army of sub-agents work.

Share

Claude Opus 4.8 shipped this week, and the timeline did what it always does — flooded with benchmark screenshots. The "trust me, this bar is taller" charts.

I've worked with enough of these releases to know benchmarks tell you direction, not day-to-day value. They show where a model is getting stronger. They don't show what actually changes about how you work. And lately, that change isn't coming from the model at all — it's coming from the harness around it: Claude Code.

Two features worth your attention

Dynamic workflows. Anthropic's solution for complex, long-running tasks. When a problem is too big for a single agent in one pass, you hand it to an orchestrator — Opus 4.8 itself — that breaks it into smaller tasks and fans out concurrent sub-agents to execute them. Not one model grinding sequentially; a manager spawning a team.

Ultra Code. A new Claude Code effort setting. It pins effort to extra-high and lets Claude decide, on its own, when a task is big enough to trigger a dynamic workflow. You stop micromanaging the orchestration and let the model judge.

How you actually work is increasingly dictated by harness updates, not benchmark deltas.
The e-commerce audit

The task: a brand audit of three sites — technical SEO scorecards, content and keyword gaps, UX flags, ranked quick wins. The kind of deliverable a mid-size agency spends days on.

Claude Code fanned out nine audit agents, acted as orchestrator, and even used idle time productively — pre-building the report generator so it could turn data into deliverables the moment agents reported back. Per-brand reports, a comparison sheet, an executive summary. Five minutes.

Diagram of dynamic workflows: an Opus 4.8 orchestrator fans out nine sub-agents arranged as a grid — three sites down the side, three checks across the top (technical SEO scorecard, content and keyword gaps, UX flags). The nine cells are the nine agents. The audits converge into a synthesis pass that outputs three brand reports, a comparison sheet and an executive summary in about five minutes.
Nine agents = three sites × three checks. One grid, no duplicated work — converging into a comparison sheet and an executive summary.
The bug hunt

An open-ended "find the bugs in this app," run under Ultra Code. It did discovery first, then decided on its own to fan out parallel auditors, then spawned a wave of verification sub-agents to adversarially double-check findings — up to 96 sub-agents end to end, for one ranked bug report.

You can actually watch it

For long-running jobs, type /workflows and you get the orchestrator's live plan: the phases, which sub-agents are done, which are running, and the tokens each consumed. For multi-minute tasks, that observability is the difference between trusting the system and staring at a spinner.

Mockup of the /workflows command output in Claude Code: a dark terminal panel showing a rubric bug audit orchestrated by Opus 4.8, with Phase 1 audit agents marked done, an 88 sub-agent verification step running live, a queued Phase 2 synthesis, a progress bar, and a footer noting 96 sub-agents total and about 4% of the weekly limit consumed.
Typing /workflows surfaces the live plan — phase status, active sub-agents, and token usage per step.
The catch: it's expensive

These modes are genuinely token-intensive — not something to fire off casually. Those two runs alone burned about 4% of a weekly Max limit.

It's also a quiet signal of how token-constrained things still are. And showing limits as a percentage is opaque. Give me an absolute token count, like a mobile data plan, so "we raised your limits" actually means something.

The takeaway

The real power of this release isn't a benchmark line — it's the harness architecture: managing multi-phase plans and synthesizing results automatically. Model-to-model jumps are nice, and 4.7 → 4.8 is a solid incremental one. But how you actually work is increasingly dictated by harness updates, not benchmark deltas. Ultra Code and dynamic workflows are the two I'd go learn.

Found this useful? Pass it on.
Share
Newsletter

Building AI products in public.

Occasional notes on what I'm shipping, what's working, and what broke — straight to your inbox. No spam, unsubscribe anytime.

N
Nirmit Meher

Product leader shipping across enterprise SaaS, AI in production, and 0→1. Writing about what actually ships — not what sounds good in a deck.