NATURAL 20
Posts
Grok 3 Leads in AI Benchmark

Grok 3 Leads in AI Benchmark

PLUS: Perplexity Open-Sources R1 1776 Model, Ilya Sutskever's Startup Eyes $1B Round and more.

Wes Roth
February 19, 2025

In partnership with

SUBSCRIBE | AI TOOLS | LEARN AI

Try Artisan’s All-in-one Outbound Sales Platform & AI BDR

Ava automates your entire outbound demand generation so you can get leads delivered to your inbox on autopilot. She operates within the Artisan platform, which consolidates every tool you need for outbound:

300M+ High-Quality B2B Prospects, including E-Commerce and Local Business Leads
Automated Lead Enrichment With 10+ Data Sources
Full Email Deliverability Management
Multi-Channel Outreach Across Email & LinkedIn
Human-Level Personalization

Book a demo to see what Ava can do.

Today:

Grok 3 Leads in AI Benchmark
Mira Murati Launches Thinking Machines Lab
OpenAI Launches SWE-Lancer Benchmark
Perplexity Open-Sources R1 1776 Model
Ilya Sutskever's Startup Eyes $1B Round

Grok 3 DESTROYS everyone... #1 in EVERY Category

Elon Musk's team at XAI has launched Grok 3, surpassing previous models like Gemini and OpenAI’s 03 Mini in several benchmarks, including reasoning and high-level math tasks.

Utilizing 200,000 GPUs in a vast data center, Grok 3 outperforms competitors and holds a strong lead in chatbot arenas. The model's growth is attributed to heavy GPU investment, with plans to expand to 1 million GPUs. Early testing shows strong performance, though further analysis is ongoing.

WATCH THE VIDEO ON YOUTUBE

Mira Murati Launches Thinking Machines Lab

Mira Murati, former CTO of OpenAI, co-founded Thinking Machines Lab, a new AI startup focused on making AI systems more understandable and customizable. The company aims to share its technology openly with external researchers. Murati, who left OpenAI after a leadership dispute, joins other former executives in launching AI ventures, contributing to the global race for advanced AI development. The lab has not disclosed its funding status.

OpenAI Launches SWE-Lancer Benchmark

OpenAI has introduced SWE-Lancer, a benchmark that assesses AI coding performance using over 1,400 freelance software engineering tasks worth $1 million. Covering areas from UI/UX to systems design, it provides a realistic evaluation of AI capabilities in real-world scenarios. Despite its promise, current AI models still face challenges in handling many of these tasks, highlighting the gap in AI's practical abilities.

Perplexity Open-Sources R1 1776 Model

Perplexity has open-sourced R1 1776, a post-trained version of the DeepSeek-R1 model, designed to provide unbiased, factual information. The model, which performs close to state-of-the-art reasoning models, had previously been limited by censorship, particularly on sensitive topics. The new version mitigates these issues through careful post-training on censored content, maintaining reasoning capabilities while allowing for a broader range of discussions. Users can access the model weights on HuggingFace or via the Sonar API.

Ilya Sutskever's Startup Eyes $1B Round

Ilya Sutskever’s AI startup, Safe Superintelligence, is nearing a $1 billion fundraising round at a $30 billion valuation, surpassing earlier expectations. Led by Greenoaks Capital Partners, the round could bring the company’s total funding to $2 billion. Founded by Sutskever and other former OpenAI researchers, Safe Superintelligence has attracted investments from Sequoia Capital, Andreessen Horowitz, and DST Global. While it is not yet generating revenue, the startup does not plan to sell AI products in the immediate future.

🧠RESEARCH

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

The NSA (Native Sparse Attention) mechanism improves long-context modeling by combining sparse attention with hardware-optimized design. It uses a dynamic hierarchical strategy for token compression and selection, offering speedups in training and inference while maintaining model performance. NSA outperforms full attention on long-context tasks and enhances efficiency.

Learning Getting-Up Policies for Real-World Humanoid Robots

This paper presents a learning framework for teaching humanoid robots how to get up after a fall, overcoming challenges like varied postures and terrain. Using a two-phase approach, the method first discovers a trajectory and then refines it for smooth, robust motions. It successfully enables a robot to get up from different positions on diverse surfaces.

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

SWE-Lancer introduces a benchmark of over 1,400 freelance software engineering tasks valued at $1 million, assessing both technical and managerial tasks. Despite testing frontier models, results show they still struggle with most tasks. The benchmark, open-sourced for future research, aims to explore AI's economic impact on freelance work.

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

HermesFlow introduces a framework to close the gap between multimodal understanding and generation in large language models. By using homologous preference data and optimizing through Pair-DPO and self-play, it aligns both capabilities effectively. Experiments show HermesFlow outperforms previous methods, offering a promising approach for future multimodal models.

🛠️TOP TOOLS

DiagramGPT - AI-powered tool developed by Fraser Xu that enables users to generate a variety of diagram types using natural language input.

Bai Chat - AI platform designed to simplify the integration of artificial intelligence into various workflows for professionals, developers, and businesses.

Image To Font Finder - AI-powered tool designed to help users identify fonts from any image.

iAsk All - AI-powered search engine designed to revolutionize the way users access information online.

Human or AI Game - Online game and research project designed to test the ability of participants to distinguish between human and AI in a conversational setting.

📲SOCIAL MEDIA

for our next open source project, would it be more useful to do an o3-mini level model that is pretty small but still needs to run on GPUs, or the best phone-sized model we can do?
— Sam Altman (@sama)
1:53 AM • Feb 18, 2025

🗞️MORE NEWS

Wu Yonghui, a former Google researcher, joined ByteDance's AI team, marking a significant hire. He previously led Google's Gemini models and now reports directly to ByteDance CEO Liang Rubo, strengthening the company's AI capabilities.
Meta announces two key events for 2025: LlamaCon on April 29, focusing on open-source AI for developers, and Meta Connect on September 17-18, showcasing updates in virtual reality, AI glasses, and mixed reality technologies.
To prevent hostile takeovers, OpenAI is considering granting its non-profit board special voting rights, allowing it to override major investors. This comes after a $97.4 billion buyout offer from a group led by Elon Musk was rejected.
AI-generated optical illusions are being explored as a new form of CAPTCHA to distinguish humans from bots. These illusions, which AI systems struggle to recognize, could enhance website security by tripping up software while remaining easy for humans to pass.

What'd you think of today's edition?

Reply

or to participate.