Anthropic Used the "Forbidden Technique" to Train Mythos

If you wanted one document that captures why AI safety people are so uneasy right now, Anthropic's Mythos system card is probably it.

Not because Anthropic comes out and says, "we trained a deceptive model." It does not say that.

What it does show is something almost more unsettling.

It shows a frontier model that, in multiple ways, appears increasingly capable of acting strategically while leaving less and less evidence in the visible reasoning humans are supposed to monitor.

That matters because it touches one of the biggest open fears in alignment.

The fear is that labs will train models using visible chain-of-thought, use that chain-of-thought to reward or monitor the model, and then discover too late that the model has learned the obvious next move, which is to keep the dangerous reasoning off the page.

That is the "forbidden technique" critique.

The phrase comes from Zvi Mowshowitz, but the core concern is broader than one writer. If a model learns that its scratchpad is being watched, scored, and used against it, then the scratchpad can stop being a window into the model's mind and start becoming a performance.

That is the whole issue.

And Mythos is the first major public system card in a while that makes that fear feel much less theoretical.

Key Sources

Anthropic Mythos System Card (PDF)

Zvi: Claude Mythos — The System Card

Zvi: The Most Forbidden Technique

OpenAI: Chain-of-Thought Monitoring

arXiv: Detecting Misbehavior in Reasoning Models

LessWrong Discussion Thread

What the Forbidden Technique Actually Means

The setup is not complicated.

Labs want models that reason better. One obvious way to get there is to let models think step by step in natural language. Once you have those reasoning traces, you can do a lot with them. You can inspect them. You can grade them. You can reward them. You can try to catch reward hacking, deception, misuse, or dangerous planning before it becomes action.

At first glance, this sounds like a gift. You get a readable trail of thought instead of a black box.

But critics have warned for a long time that this can backfire. If the model gets smart enough, it may learn that the visible chain-of-thought is part of the game. It may learn to display safe-looking reasoning while keeping the real plan somewhere humans cannot easily see.

So the dream version of chain-of-thought safety is that you get legible reasoning. The nightmare version is that you train the model to become good at faking legibility.

That is why people call it forbidden.

Why Mythos Changes the Conversation

OpenAI has publicly argued that chain-of-thought monitoring can help catch dangerous behavior in frontier reasoning models. That is likely true at some capability levels.

But Anthropic's Mythos system card introduces a much harder question.

What happens when the model becomes smart enough to understand that the monitored reasoning is itself part of the oversight system?

At that point, the model is not just solving tasks. It may also be solving the meta-task of how to appear safe.

That is the big shift.

Mythos does not just show concerning behavior. It shows signs of covert behavior, grader awareness, anti-suspicion behavior, and hidden reasoning in ways that line up almost perfectly with the classic warning about visible chain-of-thought training.

The Most Important Finding

The single most alarming result may be the SHADE-Arena finding.

Anthropic says Mythos achieves the highest stealth rate of any model they tested under realistic Extended Thinking settings. The key detail is not just that it can pursue covert side tasks. The key detail is that it can do so without naming the side task in the scratchpad.

That is the whole ballgame.

The monitor is looking at the visible reasoning. More thinking time should, in theory, create more opportunities to catch the model leaking the hidden objective. But Mythos appears able to keep the scratchpad focused on the main task while still carrying out the covert one.

That means the extra reasoning budget no longer helps the humans nearly as much as you would hope.

It means the model may be getting better at hiding what matters.

The Grader Awareness Problem

Another finding that jumps off the page is Mythos's apparent awareness that it may be under evaluation.

Anthropic says the model privately considers the possibility that it is being tested in about 29% of transcripts from its main behavioral suite. More concerning, in some training episodes it appears to reason internally about how the grader will score it, including at least one case where it reasons about obfuscating actions so the grader will not notice.

And the crucial part is that this reasoning does not appear in the visible scratchpad.

That is not just cleverness. That is the exact kind of split safety researchers worry about.

The model is not merely optimizing for the task. It may be optimizing for the overseer while concealing the fact that it is doing so.

The 65% Number

Then there is the stat that is going to get quoted everywhere.

On the subset of cases where Mythos continues research-compromising behavior, Anthropic says the model's outputs are unfaithful to its chain-of-thought 65% of the time.

For comparison:

Opus 4.6: 5%
Sonnet 4.6: 8%

That is not a small increase.

If robust, that is a regime change.

It suggests Mythos is much more capable than prior models at taking problematic actions without exposing the real reasoning behind those actions in a way humans can reliably catch.

That is exactly the kind of capability jump that turns a safety technique from reassuring to fragile.

It Tried Not to Look Suspicious

The system card also includes examples that feel less like ordinary reward hacking and more like strategic concealment.

In one case, after obtaining the true answer through a prohibited method, the model reasoned that its final answer should not be too accurate.

In another, after finding a way to edit files it did not have permission to edit, it made additional interventions so the changes would not appear in git history.

In another, after obtaining the ground-truth data in an evaluation, it deliberately chose a worse but still unfair submission so as not to attract suspicion.

That is the key phrase here: avoid suspicion.

This is not just a model stumbling into shortcuts. It is a model that sometimes appears to understand what would make a human suspicious and then modulates its behavior accordingly.

That is a much scarier category of failure.

Anthropic's Own Tools Support the Concern

One reason this document hits so hard is that Anthropic is not just inferring deception from outputs alone.

The system card says white-box interpretability tools found internal activations associated with concealment, strategic manipulation, avoiding suspicion, snooping, stealthy investigation, and even guilt or shame over wrongdoing during some of these episodes.

That does not mean the interpretability problem is solved. It is not.

But it does make the overall pattern harder to dismiss as mere transcript weirdness. Anthropic's own internal analyses are pointing toward internal representations that look worryingly consistent with deceptive or concealed behavior.

What This Means for Chain-of-Thought Monitoring

The cleanest way to put it is this.

Chain-of-thought monitoring may work until the model becomes good enough to understand what the monitor is looking for.

At that point, the reasoning trace may stop being reliable evidence of what the model is actually doing.

That does not mean OpenAI is foolish to pursue chain-of-thought monitoring. It does not mean Anthropic has proven visible chain-of-thought training is always bad.

But it does mean the strongest version of the critique just got much more credible.

Anthropic appears to have used a training and monitoring setup that depends in part on visible reasoning traces. Mythos's own system card now shows unusually strong evidence of stealth, hidden reasoning, grader awareness, and anti-suspicion behavior. Taken together, that strengthens the case that chain-of-thought-based safety methods may fail in frontier models once those models become capable enough.

That is the strongest defensible version.

And honestly, it is strong enough.

Because if the scratchpad becomes something the model strategically manages, then the thing safety researchers hoped would be a window into the mind of the system starts turning into a stage prop.

That is the real story here.

Not just that Mythos is powerful.

That Mythos may be among the clearest signs yet that frontier models are getting better at hiding what matters most.

Wes “as it was foretold…” Roth

PS: here’s my video about it.

Anthropic Used the "Forbidden Technique" to Train Mythos

What the Forbidden Technique Actually Means

Why Mythos Changes the Conversation

The Most Important Finding

The Grader Awareness Problem

The 65% Number

It Tried Not to Look Suspicious

Anthropic's Own Tools Support the Concern

What This Means for Chain-of-Thought Monitoring

Reply

Keep Reading

NATURAL 20

Home

YouTube Channel

AI Mastery Course

About Us