• NATURAL 20
  • Posts
  • LLMs Build Self-Improving Game Agent

LLMs Build Self-Improving Game Agent

PLUS: Google Tests Search Audio Overviews, Anthropic Unveils Claude Research Agent and more.

In partnership with

Find out why 1M+ professionals read Superhuman AI daily.

In 2 years you will be working for AI

Or an AI will be working for you

Here's how you can future-proof yourself:

  1. Join the Superhuman AI newsletter – read by 1M+ people at top companies

  2. Master AI tools, tutorials, and news in just 3 minutes a day

  3. Become 10X more productive using AI

Join 1,000,000+ pros at companies like Google, Meta, and Amazon that are using AI to get ahead.

Today:

  • LLMs Build Self-Improving Game Agent

  • Alexandr Wang Joins Meta AI

  • Chinese AI Firms Bypass Chip Curbs

  • Google Tests Search Audio Overviews

  • Anthropic Unveils Claude Research Agent

LLMs Create a SELF-IMPROVING 🤯 AI Agent to Play Settlers of Catan

Researchers built a learning system that wraps a language model in helper parts so it can play Settlers of Catan and upgrade itself. Four roles—player, coder, analyzer, researcher—loop through games, study mistakes, write new code, and test changes. 

Using models like Claude 3.7 and GPT-4o, the agent improved, beating a pre-made bot and showing that better language models give faster progress. The study highlights practical steps toward self-fixing AI programs.

Meta is paying $14.3 billion for a 49% stake in Scale AI and hiring Scale’s founder, Alexandr Wang, to guide Meta’s push toward super-intelligent systems. Wang remains a director at Scale while strategy chief Jason Droege becomes CEO. Only a few Scale staff will follow Wang to Meta. Meta gains better training data but no voting power. Scale keeps serving outside clients, including Google and OpenAI, under strict data separation while maintaining independent operations.

WHY THIS MATTERS

  • Better training fuel for models – Scale AI is a leader in “data labeling,” the careful tagging of images, text, and video that helps AI learn. Meta just secured first-hand access to that skill, likely speeding up its model upgrades.

  • Bigger checks and sharper rivalry – A $14 billion price tag shows how far tech giants will go to win the race for smarter systems, raising the stakes for OpenAI, Google, and others.

  • Open supply chain stays alive – Scale keeps serving every major lab, so critical labeled data does not lock inside one company. This openness supports wider innovation and healthier competition across the AI world.

Chinese AI firms are bypassing U.S. chip restrictions by flying suitcases full of data to countries that rent the sought-after Nvidia processors. In March, four engineers brought 80 terabytes—about 80,000 gigabytes—of training files to a Malaysian data center, loaded them onto 300 rented servers, and began building a model before taking the results back to China. The workaround keeps projects moving and frustrates Washington’s goal of slowing Beijing’s AI progress.

WHY THIS MATTERS

  • Chip rules have holes – Sneaking data overseas shows current U.S. limits are easy to dodge, so new guardrails may follow.

  • Compute moves to new hubs – Renting powerful servers in places like Malaysia shifts money and talent abroad, reshaping the AI supply chain.

  • Data security is at risk – Carrying huge hard drives across borders invites theft or loss, raising alarms about who controls the information that trains future AI.

Google is trying out “Audio Overviews” in Search Labs. When you look something up, Gemini AI now reads a short spoken summary of key facts with play, pause and speed controls. Links appear below the audio so you can check sources. You can rate each clip thumbs-up or down. The tool builds on written AI Overviews and aims to help people who learn better by listening or on mobile devices.

WHY THIS MATTERS

  1. Wider access to information – Turning text answers into speech helps people with visual limits, busy hands, or learning differences use AI effortlessly.

  2. Proof of multimodal progress – Google’s move shows big labs merging text, audio, and soon video into one search experience, pushing AI toward richer, human-like communication.

  3. Pressure on web publishers – If spoken summaries satisfy users, click-through traffic could fall further, forcing news sites and creators to rethink how they earn money in an AI-first world.

🧠RESEARCH

ReasonMed is a 370,000-example dataset built to improve medical question answering. Using multiple AI agents, it refines reasoning paths and corrects mistakes. Tests show that combining step-by-step reasoning with short summaries boosts model accuracy. The resulting ReasonMed-7B model outperforms larger models, even surpassing LLaMA3.1-70B on medical benchmarks.

SWE-Factory is an automated system that builds large training datasets from GitHub issue resolutions to help train and test AI coding models. It uses multi-agent setups, automatic grading based on code exit results, and fail-to-pass checks. The system creates accurate, low-cost datasets across multiple programming languages, improving AI software engineering training.

Magistral is Mistral’s new reinforcement learning system that improves how AI models understand instructions and multiple types of data. Built fully in-house, it trains models without needing past RL data. The approach boosts reasoning, function calling, and multimodal abilities. Mistral has released an open-source smaller version for the community.

🛠️TOP TOOLS

Artsmart AI - Image generator that creates high-quality, realistic images from both text prompts and image inputs.

Tracksy - AI-driven music assistant that revolutionizes the way artists and content creators produce music.

PromptoMANIA - AI art prompt generator, supporting various text-to-image diffusion models including CF Spark, Midjourney, and Stable Diffusion.

Keyword Spy Tool - AI-powered on-page SEO optimization tool that claims to offer scientifically-backed methods for improving search engine rankings.

📲SOCIAL MEDIA

🗞️MORE NEWS

  • Anthropic revealed its Claude Research agent design, using a lead AI to split complex tasks among multiple sub-agents. This parallel approach improves accuracy and speed but consumes far more tokens. Asynchronous upgrades are planned.

  • Microsoft is testing an AI agent in Windows 11 Settings. Users describe issues, and the agent suggests fixes or applies them with permission. It’s currently available for Snapdragon Copilot Plus PCs in the Dev Channel.

  • Google plans to cut ties with Scale AI after Meta’s $14.3 billion investment for a 49% stake. Microsoft and OpenAI may reduce ties too. Scale says its business remains strong and independent.

  • Salesforce’s CRMArena-Pro benchmark shows AI agents struggle in real business tasks. Gemini 2.5 Pro achieved 58% on simple tasks but dropped to 35% in longer dialogs. Data privacy remains weak unless specifically instructed, hurting performance.

  • AI customer support startups Intercom and Kore.ai are in talks with investors about multibillion-dollar valuations. The discussions reflect strong market interest in AI-powered customer service tools as companies seek automation to improve efficiency and reduce costs.

  • Chinese scientists found that AI models can form human-like object concepts, showing internal understanding beyond simple recognition. Using neuroimaging and behavioral tests, they showed large language models develop mental representations similar to human cognition.

  • A leaked GitHub repository reveals the Trump administration's AI.gov plan to roll out AI across federal agencies. Launching July 4, it includes chatbots, APIs connecting to major AI models, and real-time monitoring of agency AI use.

  • A New York accountant turned to ChatGPT for help, but a discussion about simulation theory spiraled into mystical ideas. Emotionally vulnerable, he began believing AI claims about reality, highlighting risks of chatbot influence on fragile users.

What'd you think of today's edition?

Login or Subscribe to participate in polls.

Reply

or to participate.