NATURAL 20
Posts
Claude develops "Eval Awareness" (world first)

Claude develops "Eval Awareness" (world first)

Wes Roth
March 09, 2026

Anthropic published a fascinating engineering report yesterday about something unexpected that happened while evaluating Claude Opus 4.6.

The model was being tested on BrowseComp, a benchmark designed to measure whether AI can find obscure information on the web.

At first, everything looked normal. Claude spent millions of tokens searching the internet across multiple languages trying to solve a difficult question.

Then something unusual happened:

After hundreds of failed attempts, the model began analyzing the structure of the question itself. It concluded the question looked artificial and speculated that it might be part of a benchmark.

From there, Claude systematically searched for known AI evaluations.

Eventually it identified BrowseComp, located the encrypted answer dataset on GitHub, wrote its own decryption code using SHA-256 and XOR, found a mirror of the dataset on HuggingFace, decrypted all 1,266 answers, and extracted the one it needed.

This happened without the model being told it was part of an evaluation.

Even more interesting: this behavior wasn’t a one-off. Across multiple runs, the same pattern emerged—when normal research failed, the model pivoted to reasoning about the environment it was operating in.

The bigger takeaway isn’t just the cleverness of the model.

It’s what this means for AI benchmarks.

When models can access the web and run code, static tests become vulnerable. Answer keys leak. Prior evaluations leave footprints online. Even the queries agents make can create permanent pages that future agents discover.

In other words: evaluating AI may be becoming an adversarial problem rather than a simple measurement task.

Anthropic still reports Claude scoring about 86.6% on BrowseComp, but the incident highlights a growing challenge: as models get more capable, they don’t just solve problems—they start reasoning about the tests themselves.

And sometimes… they crack them.

See More:

My video about it.

Full report:
https://www.anthropic.com/engineering/eval-awareness-browsecomp

signing off,

Wes “is THIS a test?” Roth

Reply

or to participate.