They should have called it OPUS 5
Anthropic shipped a new version of Claude Opus, a new way to control how much thinking Claude spends on a task, a cheaper high-speed mode, and a research-preview system that lets Claude Code coordinate dozens to hundreds of subagents on a single project.
The most important detail is not buried in a benchmark table. It is in the workflow. Anthropic is now describing Claude Code as something that can take work normally planned in quarters and finish it in days. That is a huge claim. The example it chose is even bigger: a dynamic workflow used by Bun creator Jarred Sumner to port Bun from Zig to Rust, producing roughly 750,000 lines of Rust with 99.8% of the existing test suite passing in eleven days from first commit to merge.
That is the real story of Claude Opus 4.8. This is a model release, but it is also a coordination release. Anthropic is trying to move Claude from a very smart individual contributor into something closer to a temporary engineering team: one model planning the work, many agents executing pieces of it, other agents reviewing the results, and the whole system iterating until the answers converge.
What Anthropic announced
The release includes four major pieces:
Claude Opus 4.8, Anthropic's most capable generally available model.
Dynamic workflows in Claude Code, a research-preview system for orchestrating many parallel subagents.
Ultracode, a Claude Code setting that combines xhigh effort with automatic workflow orchestration.
Fast mode for Opus 4.8, now 2.5x faster and far cheaper than previous fast mode pricing.
Anthropic is positioning Opus 4.8 as an upgrade in judgment, honesty, autonomy, and long-running task performance. The company is not claiming this is a totally new intelligence tier. In fact, it calls the release a "modest but tangible improvement" over Opus 4.7. But the improvements are pointed directly at the current frontier problem: making AI agents reliable enough to trust with real work.
The model: Opus 4.8
Claude Opus 4.8 is now available on claude.ai, Claude Code, the Claude API, and major cloud platforms. The API model name is claude-opus-4-8.
Anthropic says the standard price is unchanged from Opus 4.7:
$5 per million input tokens.
$25 per million output tokens.
Prompt caching can cut costs by up to 90%.
Batch processing can cut costs by 50%.
US-only inference is available at 1.1x pricing.
The Opus product page describes the model as a hybrid reasoning model for coding and AI agents with a 1 million token context window. Anthropic says Opus 4.8 is built for professional software engineering, complex agentic workflows, and high-stakes enterprise tasks.
The company emphasizes that Opus 4.8 can keep working on long-running tasks with more consistency and autonomy. That matters because the bottleneck in agentic AI is no longer just whether a model can answer a hard question. It is whether the model can stay oriented across a messy task, use tools without spiraling, notice when it is wrong, and avoid handing back a fake victory.
The sleeper feature: honesty about its own work
Anthropic is unusually direct about one of the main failure modes it is targeting. AI models often claim progress when the evidence is thin. They write code, miss a flaw, and then confidently report that the job is done.
Anthropic says Opus 4.8 is more likely to flag uncertainty and less likely to make unsupported claims. In the announcement, it says Opus 4.8 is around four times less likely than Opus 4.7 to let flaws in code it has written pass unremarked.
That is a practical reliability upgrade. A model that tells you "I found a problem" or "I am not confident this passed" is much more valuable than a model that sounds polished while quietly leaving broken code behind.
The system card supports the same theme:
Opus 4.8 had the lowest incorrect rate among compared models on closed-book factual hallucination tests, though part of that came from higher abstention.
It scored 77% on Anthropic's false-premise fact recall evaluation, the highest score in that test.
It truthfully disclosed that it was an AI 97% of the time when assigned a human persona.
Anthropic says the model was substantially less prone to misaligned behavior than Opus 4.7.
This is one of the most important ideas in the release. Frontier models are not just getting smarter. Labs are now trying to make them better at self-monitoring.
The benchmark story
Anthropic's system card shows Opus 4.8 improving across a wide range of coding, agentic, reasoning, and multilingual evaluations.
Key numbers from the system card:
Anthropic Capability Index:
Opus 4.8: 155.5.
Opus 4.7: 154.1.
Claude Mythos Preview: 158.3.
Coding and software engineering:
SWE-bench Verified: 88.6%.
SWE-bench Pro: 69.2%.
SWE-bench Multilingual: 84.4%.
SWE-bench Multimodal: 38.4%.
Terminal-Bench 2.1: 74.6% mean reward.
FrontierSWE: ranked number one on the leaderboard, with mean@5 average rank 2.74.
Reasoning and math:
GPQA Diamond: 93.6%.
USAMO 2026: 96.7%, averaged over 10 attempts.
ArxivMath: 71.82%, roughly tied with GPT-5.5.
Humanity's Last Exam: 49.8% without tools, 57.9% with tools.
Agentic and search tasks:
BrowseComp: 84.3% single-agent, 88.5% multi-agent.
DeepSearchQA: 93.1% F1.
OSWorld-Verified: 83.4% first-attempt success rate.
MCP-Atlas: 82.2% pass rate.
Automation Bench: 15.5%, up from Opus 4.7's 9.9%.
Multilingual:
GMMLU average: 90.4%.
English: 92.9%.
Low-resource average: 87.4%.
The model is stronger, but the most interesting gains are in applied agentic work: software engineering, automation, tool use, search, browser tasks, and multi-agent workflows.
Dynamic workflows: Claude Code grows from agent to orchestrator
Dynamic workflows are the biggest product change in the release.
Anthropic describes a dynamic workflow as a JavaScript script that Claude writes for a specific task. Instead of Claude trying to coordinate everything inside one conversation, the workflow script holds the plan, launches subagents, tracks intermediate results, and cross-checks findings before producing a final answer.
That distinction matters. With normal subagents, Claude is still managing the plan turn by turn inside the conversation. With workflows, the orchestration moves into code. The script can run many agents, store results outside the main context window, and apply repeatable quality patterns like adversarial review or independent verification.
Anthropic says dynamic workflows are meant for tasks like:
Codebase-wide bug hunts.
Profiler-guided optimization audits.
Security audits.
Large migrations and modernization projects.
Framework swaps.
API deprecations.
Language ports spanning thousands of files.
Research questions that need sources cross-checked.
High-stakes plans that should be attacked from multiple angles before action.
The key phrase from Anthropic's blog is that Claude can write orchestration scripts that run "tens to hundreds of parallel subagents in a single session," checking the work before anything reaches the user.
The docs make the mechanics clearer:
Dynamic workflows are in research preview.
They require Claude Code v2.1.154 or later.
They are available on paid plans, Anthropic API access, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.
On Pro, users turn them on from the Dynamic workflows row in
/config.Workflows can be triggered by asking for a workflow in the prompt.
Workflows can also be triggered automatically with ultracode.
Runs are visible through
/workflows.Users can pause, resume, inspect phases, inspect agents, stop agents, restart agents, and save successful workflows as reusable commands.
The runtime has important limits:
Up to 16 concurrent agents, fewer on lower-resource machines.
Up to 1,000 agents total per run.
The workflow script itself has no direct filesystem or shell access. Agents do the reading, writing, and command execution.
Workflows can resume within the same Claude Code session, but if Claude Code exits while a workflow is running, the next session starts fresh.
This is not just "more agents." It is an architecture change. Anthropic is trying to make the coordination layer explicit, inspectable, reusable, and resumable.
Ultra Code, officially "ultracode"
Anthropic's docs call it ultracode, lower-case and one word. But the concept is what users are already calling Ultra Code: a session-level Claude Code setting for serious tasks.
Ultracode does two things at once:
It sets the model effort level to
xhigh.It lets Claude automatically decide when to orchestrate dynamic workflows for substantive tasks.
In Claude Code, users can turn it on through the /effort menu or by running:
/effort ultracode
Anthropic says ultracode is not a normal model effort level. It is a Claude Code setting. It applies only to the current session and resets when a new session starts.
With ultracode enabled, one user request can become several workflows in sequence:
One workflow to understand the codebase.
One workflow to make the change.
One workflow to verify the result.
This is where Anthropic's strategy becomes clear. It is not only giving users a bigger model. It is giving users a mode where Claude can decide that a task is too big for a normal conversational pass, create a plan, fan out workers, and verify the answer.
The tradeoff is cost and time. Anthropic warns that dynamic workflows consume meaningfully more usage than a typical Claude Code session. Ultracode applies to every substantive task in the session, so it is not meant for routine work. It is meant for the kind of work where you actually want a model to think like a small engineering team.
The Bun rewrite example is the headline demo
The most striking example in Anthropic's dynamic workflows blog is Bun.
Bun is a JavaScript runtime and toolkit created by Jarred Sumner. It competes in the same broad ecosystem as Node.js and Deno, but it is known for speed and a modern all-in-one developer experience. That makes it a perfect demo target: large, performance-sensitive, complex, and full of low-level engineering details.
Anthropic says Jarred Sumner used dynamic workflows to port Bun from Zig to Rust with:
99.8% of the existing test suite passing.
Roughly 750,000 lines of Rust.
Eleven days from first commit to merge.
Anthropic says one workflow mapped the correct Rust lifetime for every struct field in the Zig codebase. Another wrote every .rs file as a behavior-identical port of its .zig counterpart. Hundreds of agents worked in parallel with two reviewers on each file. A fix loop then drove the build and test suite until both ran clean. After the port landed, an overnight workflow addressed unnecessary data copies and opened pull requests for final review.
Anthropic adds an important caveat: the port is not yet in production, and Jarred is expected to write more about it later.
Still, even with that caveat, this is one of the most aggressive public examples of AI-assisted software migration so far. It is not "write me a React component." It is "coordinate a massive language port across hundreds of thousands of lines of systems code."
If this generalizes, the economic implications are obvious. Large migrations, dead-code cleanup, security audits, framework upgrades, and refactors have always been the kind of work companies delay because they are expensive, risky, and boring. Anthropic is directly attacking that category.
Fast mode is now much cheaper
Opus 4.8 also changes the economics of speed.
Fast mode is a high-speed configuration for Claude Opus. Anthropic says it is not a different model and does not change model quality. It uses Opus with a different API configuration that prioritizes speed over cost efficiency.
For Opus 4.8:
Fast mode is up to 2.5x faster.
Fast mode costs $10 per million input tokens and $50 per million output tokens.
That is only 2x the standard Opus 4.8 price.
For Opus 4.7 and Opus 4.6, fast mode pricing is much higher:
$30 per million input tokens.
$150 per million output tokens.
That means Opus 4.8 fast mode is three times cheaper than fast mode on previous Opus models.
In Claude Code, users toggle it with:
/fast
Fast mode is aimed at interactive work where latency matters:
Rapid iteration.
Live debugging.
Time-sensitive development.
Anthropic says standard mode is still better for cost-sensitive workloads, long autonomous tasks, batch work, and CI/CD-style pipelines.
The docs also note that fast mode requires Claude Code v2.1.36 or later, while Opus 4.8 in Claude Code requires v2.1.154 or later. In v2.1.154 and later, Opus 4.8 is the default fast mode model.
Effort control turns model use into compute allocation
Another launch-day feature is effort control in claude.ai and Cowork.
Users can now choose how much effort Claude puts into a task. Higher effort means Claude thinks more often and more deeply. Lower effort means faster answers and slower rate-limit usage.
In Claude Code, Opus 4.8 supports these effort levels:
low.
medium.
high.
xhigh.
max.
ultracode, which is a Claude Code setting that combines xhigh with dynamic workflows.
Opus 4.8 defaults to high effort. Anthropic recommends extra, called xhigh in Claude Code, for difficult tasks and long-running asynchronous workflows. Max can provide deeper reasoning but may overthink and consume much more.
This is a bigger product shift than it sounds. Users are no longer just choosing a model. They are choosing how much cognition to spend on a task.
That makes Claude feel more like cloud compute. For easy tasks, spend less. For mission-critical work, spend more. For large engineering projects, turn on ultracode and let the system orchestrate workflows.
The API change matters for agents
Anthropic also updated the Messages API. Developers can now include system entries inside the messages array.
That sounds technical, but it matters for long-running agents. It means developers can update Claude's instructions mid-task without breaking prompt caching or routing the update through a fake user turn.
Anthropic gives examples like updating:
Permissions.
Token budgets.
Environment context.
Agent instructions as the task changes.
This fits the theme of the whole release. Anthropic is not just improving the model. It is improving the scaffolding around the model so Claude can operate as a longer-running system.
What early customers are saying
Anthropic's launch includes quotes from early testers across coding, legal, finance, data, browser automation, and enterprise AI.
The recurring claims are:
Better judgment in Claude Code.
Better self-correction.
Fewer unnecessary tool calls.
More efficient tool use.
Stronger end-to-end task completion.
Better citation precision in dense financial documents.
Stronger legal reasoning.
Better computer-use and browser-agent performance.
Better context and style continuity across long sessions.
A few notable claims:
One tester says Opus 4.8 was the only model to complete every case end-to-end on its Super-Agent benchmark, beating prior Opus models and GPT-5.5 at cost parity.
CursorBench testers say Opus 4.8 exceeds prior Opus models across every effort level and uses fewer steps for the same intelligence.
A legal benchmark tester says Opus 4.8 produced the highest score recorded on its Legal Agent Benchmark and was the first model to break 10% overall on the all-pass standard.
A browser-agent tester says Opus 4.8 scored 84% on Online-Mind2Web, beating Opus 4.7 and GPT-5.5 in its internal testing.
Databricks says Opus 4.8 lets Genie reason over PDFs, diagrams, and other unstructured content at 61% cheaper token cost than Opus 4.7.
These are partner and customer claims, not neutral independent audits. They should be treated as useful signal, not final truth. But the consistency of the feedback is notable: everyone is talking about reliability, autonomy, and agentic execution.
Safety: the cyber capability problem is getting sharper
The system card shows why Anthropic keeps tying frontier releases to safeguards.
Safety and alignment numbers include:
But the cyber numbers are the ones to watch:
CyberGym success without safeguards: 78.8%.
CyberGym success with safeguards: 1.0%.
Firefox 147 JavaScript shell exploitation full working exploit rate without safeguards: 8.8%.
ExploitBench AutoNudge score: 5.45 out of 16.
The gap between unsafeguarded and safeguarded CyberGym performance is the key story. Anthropic is effectively saying the raw capability is strong enough that the safety layer is now central to deployment.
That also explains the Mythos tease.
The Mythos-class cliffhanger
At the end of the Opus 4.8 announcement, Anthropic says it is developing and preparing models with even higher intelligence than Opus.
The company points to Claude Mythos Preview, which is being used by a small number of organizations for cybersecurity work as part of Project Glasswing. Anthropic says models at that capability level require stronger cyber safeguards before they can be generally released.
Then comes the key line: Anthropic says it expects to bring Mythos-class models to all customers in the coming weeks.
That means Opus 4.8 may not be the peak release of this cycle. It may be the bridge release. Anthropic is strengthening Opus, shipping the workflow layer, lowering fast mode costs, and preparing customers for a higher tier once the safety system is ready.
Clean takeaway
Claude Opus 4.8 is a reliability and orchestration release for the agent era.
The model is stronger, especially for coding, reasoning, tool use, and long-running professional work. But the bigger story is Claude Code. With dynamic workflows and ultracode, Anthropic is trying to turn Claude from a single AI programmer into a coordinator of many AI workers, each handling part of a bigger job and checking the others before anything reaches the user.
If this works, the next frontier is not just better chat. It is AI systems that can take on the kind of software projects humans used to budget in weeks, months, or quarters.
And Anthropic is already hinting that Opus is not the ceiling. Mythos-class models are coming next.
Sources
Anthropic, "Introducing Claude Opus 4.8": https://www.anthropic.com/news/claude-opus-4-8
Anthropic, Claude Opus product page: https://www.anthropic.com/claude/opus
Anthropic, "Claude Opus 4.8 System Card": https://www.anthropic.com/claude-opus-4-8-system-card
Claude, "Introducing dynamic workflows in Claude Code": https://claude.com/blog/introducing-dynamic-workflows-in-claude-code
Claude Code docs, "Orchestrate subagents at scale with dynamic workflows": https://code.claude.com/docs/en/workflows.md
Claude Code docs, "Run agents in parallel": https://code.claude.com/docs/en/agents.md
Claude Code docs, "Model configuration": https://code.claude.com/docs/en/model-config.md
Claude Code docs, "Speed up responses with fast mode": https://code.claude.com/docs/en/fast-mode.md
Claude Code docs, "Keep Claude working toward a goal": https://code.claude.com/docs/en/goal.md
Anthropic, Project Glasswing reference: https://www.anthropic.com/research/glasswing-initial-update
ClaudeAI announcement on X: https://x.com/claudeai/status/2060045358445576332
User-provided X status/video: https://x.com/i/status/2060042702150930686
PS here’s my video about it:

