- NATURAL 20
- Posts
- OpenAI's New Science Benchmark
OpenAI's New Science Benchmark
PLUS: xAI Upgrades Grok Video, Anthropic Updates Claude Design and more.

LLM traffic converts 3× better than Google search
58% of buyers now start their research in ChatGPT or Gemini, not Google. Most startups aren't showing up there yet.
The ones that are get cited by the AI tools their buyers, investors, and future hires already use. And they convert at 3×.
Download the free AEO Playbook for Startups from HubSpot and get the exact steps to start showing up. Five minutes to read.
Today:
OpenAI's New Science Benchmark
Perplexity Gives AI Memory
Midjourney Enters Medical Tech
xAI Upgrades Grok Video
Anthropic Updates Claude Design
OpenAI has launched LifeSciBench, a rigorous new benchmark designed to test AI models on complex, real-world life science research. While older benchmarks evaluate models on their ability to recall simple biological facts, LifeSciBench assesses how well an AI can handle the messy reality of actual scientific work — interpreting conflicting data, designing assays, and troubleshooting experiments.
Important Details:
Expert-Authored: The benchmark features 750 tasks created by 173 practicing life scientists with Ph.D.-level training and industry biotech experience.
Realistic Complexity: 79% of the tasks require multiple reasoning steps (averaging four steps per task), moving far beyond structured multiple-choice questions.
Data-Rich Analysis: The tasks include 1,062 attached artifacts — genomic sequence files, PDFs, chemical structures, and figures — that the AI must analyze to reach a conclusion.
Nuanced Grading: Responses aren't just graded as "right" or "wrong." Expert-developed rubrics containing over 19,000 criteria evaluate if the AI included the proper caveats, justifications, and formatting expected by a human scientist.
Perplexity AI is rethinking how AI memory functions with the introduction of Brain. Instead of just remembering user preferences (like your favorite tone of voice or formatting), Brain is a self-improving memory system that tracks the work an agent actually does. By learning from its successes, dead ends, and user corrections, the AI agent essentially "learns on the job" and gets better over time.
Important Details:
The Context Graph: Brain automatically builds and updates an "LLM wiki" overnight. This persistent, updating map tracks the projects, sources, and context relevant to your specific tasks.
Learning from Mistakes: It remembers when a specific source led to a dead end or when a user corrected an output, preventing the AI from repeating past errors.
Performance Boost: Early data shows Brain increases answer correctness by 25% and recall by 16% on tasks the AI has seen before.
Cost Efficiency: By getting to the right answer faster and with fewer prompt turns, Brain cuts the computing cost of tasks requiring historical context by 13%.
Availability: The feature is currently rolling out in Research Preview to Perplexity Max and Enterprise Max subscribers.
Midjourney — the company famous for its generative AI art — has announced a surprising pivot into preventative healthcare with a new division called Midjourney Medical. Their first product is the Midjourney Scanner, a full-body ultrasound machine that aims to deliver MRI-like image quality in under 60 seconds. The company plans to deploy these scanners in high-end wellness spas, making regular internal health tracking as casual as a spa visit.
Important Details:
How It Works: The patient lowers into a shallow pool of water where a ring of 500,000 tiny sensors acts like a dolphin using echolocation. They emit sound waves to map muscle, fat, bone, and organs in a highly detailed 3D map.
Speed & Safety: A full-body scan takes about 60 seconds (compared to 60–90 minutes for a traditional MRI) and utilizes no harmful radiation or powerful magnetic fields.
The Hardware Partnership: The device was co-developed with Butterfly Network, utilizing 40 of their advanced "Ultrasound-on-Chip" imaging modules paired with two petaflops of processing power.
The Rollout: Initial scans will focus on body composition mapping, which does not require strict FDA diagnostic clearance. Midjourney plans to open its first scanner-equipped spa in San Francisco's Union Square by late 2027.
Long-term Ambition: CEO David Holz wants 50,000 scanners deployed globally by 2031, aiming to use proactive early imaging to avoid 30% of all deaths and cut healthcare costs in half.
🧠RESEARCH
LoopCoder-v2 tests language models that reuse the same processing layers, allowing internal thinking without adding parameters, or learned settings. A seven-billion-parameter coding model performed best with two cycles, raising SWE-bench Verified from 43.0 to 64.4. More cycles hurt accuracy because improvements faded while position errors remained, making extra computation wasteful.
ZPPO trains smaller AI models using a stronger teacher’s answers as guidance inside prompts, instead of forcing students to copy them directly. Hard questions are practiced using correct and incorrect examples. Across 31 language, image, and video tests, ZPPO beat other training methods, with biggest improvements in the smallest models.
GameCraft-Bench tests whether coding agents can build complete, playable games in the Godot engine from written instructions. Its 140 tasks cover 15 game types and judge mechanics, content, visuals, and presentation through recorded gameplay. The best agent scored only 41.46 percent, showing today’s systems often create prototypes, not finished games.
📲SOCIAL MEDIA
🗞️MORE NEWS
xAI Upgrades Its Video Generator xAI released Grok Imagine Video 1.5, an upgraded system that turns still pictures into moving video clips. This new version creates smoother motion, matches sounds to the video better, and builds clips almost twice as fast as older versions. It also adds new project folders so creators can organize multiple ideas at the same time.
Anthropic Sharpens Claude Design Anthropic launched updates for Claude Design, an AI tool used to build visual layouts and computer interfaces. The system now remembers a company’s specific brand colors and styles across different projects, letting users make direct edits right on their screen. It also connects smoothly with coding tools and other everyday software apps to speed up normal work routines.
ChatGPT Can Now Schedule Chores OpenAI gave ChatGPT a new feature called "Scheduled Tasks," allowing the AI to automatically send reminders, do repeating chores, or watch the internet for updates. Users get a dedicated page to view, pause, or edit these automated actions in one place without needing to prompt the AI every time.
OpenAI Gears Up for Physical Gadgets Ha Thai recently left Meta to join OpenAI as the head of public relations for their upcoming devices. Her hiring is a strong signal that OpenAI is preparing to release its very first physical hardware product later this year.
Star Researcher Leaves Google for OpenAI Noam Shazeer, a key pioneer in how modern artificial intelligence is structured, has left Google to join rival OpenAI. He had just returned to Google in 2024 after the company spent $2.7 billion to bring his startup team back in-house.
OpenAI Hires Former Government Advisor OpenAI hired Dean Ball, a former AI advisor to the Trump administration, to lead a new team focused on future company strategy. He will help shape OpenAI's internal rules and figure out how the company should handle complex government laws regarding advanced technology.
What'd you think of today's edition? |


Reply