NATURAL 20
Posts
Stress Testing OpenAI's Operator

Stress Testing OpenAI's Operator

PLUS: ElevenLabs Hits $3B Valuation, Open-Source Qwen2.5 with 1M Tokens and more.

Wes Roth
January 27, 2025

In partnership with

SUBSCRIBE | AI TOOLS | LEARN AI

There’s a reason 400,000 professionals read this daily.

Join The AI Report, trusted by 400,000+ professionals at Google, Microsoft, and OpenAI. Get daily insights, tools, and strategies to master practical AI skills that drive results.

Today:

Stress Testing OpenAI's Operator
Zuckerberg Bets $60B on AI
OpenAI Enhances ChatGPT's Canvas for Developers
ElevenLabs Hits $3B Valuation
Open-Source Qwen2.5 with 1M Tokens
Tencent AI Launches Hunyuan3D Studio
Perplexity AI Proposes TikTok Merger

OpenAI Operator UNLEASHED: Stress Testing AI Agents in the Wild (Hands On Testing)

OpenAI's surprise release of "Operator" introduces a cutting-edge AI agent capable of navigating websites, completing multi-step tasks, and assisting with various online activities. Despite early glitches like handling pop-ups and specific tasks, it excelled at complex scenarios like Instacart shopping.

The AI uses a virtual browser and simulates human-like mouse and keyboard actions. While not perfect or fully consumer-ready, it demonstrates impressive potential and sets a new benchmark for computer-use AI agents.

WATCH THE VIDEO ON YOUTUBE

Zuckerberg Bets $60B on AI

Mark Zuckerberg announced Meta's $60-65 billion investment in AI for 2025, aiming to make Meta AI the leading assistant for over 1 billion users. Plans include launching Llama 4, expanding AI capabilities in its platforms, and building a massive data center. Despite being named in a lawsuit over Llama’s training methods, Zuckerberg emphasized Meta's role in driving innovation and maintaining U.S. tech leadership amid global competition in AI infrastructure development.

OpenAI Enhances ChatGPT's Canvas for Developers

OpenAI has upgraded ChatGPT's Canvas feature to include HTML and React code rendering, allowing users to preview code directly in the interface. These improvements also include support for the new o1 model, available only to paid subscribers. Canvas, introduced in 2023, already features a Python emulator and enhanced content editing. The updates position ChatGPT to compete with Anthropic’s Claude.ai, aiming to make AI-powered coding and collaboration more intuitive for users.

ElevenLabs Hits $3B Valuation

ElevenLabs, an AI-driven voice technology startup, raised $250 million in a Series C round at a $3 billion valuation led by ICONIQ Growth. Known for voice cloning and dubbing tools, its technology is used by major publishers, gaming companies, and text-to-video platforms. Founded in 2022, ElevenLabs has grown rapidly, reaching $90 million in ARR. Despite competition and misuse concerns, it continues to expand with safeguards and innovative features.

Open-Source Qwen2.5 with 1M Tokens

Qwen Team has released Qwen2.5-1M, an open-source AI model capable of processing up to 1 million tokens. With Qwen2.5-7B/14B-Instruct-1M, enhanced by sparse attention and length extrapolation, the models excel in long-context tasks while maintaining short-text performance. The open-source vLLM-based inference framework improves processing speed by up to 7x. Developers can deploy these models locally, offering new opportunities in long-text applications like research, document analysis, and creative workflows.

Tencent AI Launches Hunyuan3D Studio

Tencent has launched Hunyuan3D 2.0, an open-source AI system that transforms 2D images into detailed 3D models. It uses Hunyuan3D-DiT for shape generation and Hunyuan3D-Paint for realistic texturing. Improvements include sharper detail recognition and higher-quality textures. The web-based Hunyuan3D-Studio allows users to create and animate 3D models, accessible via Tencent platforms. The system outperforms competitors, highlighting Tencent's push in AI-driven 3D innovation alongside Nvidia, Meta, and others.

Perplexity AI Proposes TikTok Merger

Perplexity AI has proposed a merger with TikTok U.S., creating a new entity called "NewCo." Under the plan, ByteDance would sell TikTok U.S. while retaining its recommendation algorithm, with the U.S. government owning up to 50% after a $300 billion IPO. The proposal, which positions the deal as a merger rather than a sale, comes as TikTok regains U.S. access following national security concerns. ByteDance and the White House have yet to comment.

🧠RESEARCH

Humanity's Last Exam

"Humanity's Last Exam" introduces a rigorous benchmark of 3,000 expert-designed questions spanning diverse subjects to test large language models (LLMs). Unlike outdated benchmarks, it challenges state-of-the-art LLMs, revealing significant gaps in their capabilities. This publicly available tool aims to push AI research and policy development forward.

SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

Shared Recurrent Memory Transformer (SRMT) enhances coordination in multi-agent systems by pooling and sharing memory across agents. Tested on challenging pathfinding tasks, SRMT outperforms traditional baselines and generalizes well to unseen scenarios. Its success highlights the potential of shared memory in improving decentralized multi-agent cooperation.

Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

The Sigma language model introduces DiffQKV attention, optimizing Query, Key, and Value components for improved inference efficiency and performance. Pre-trained on 6T tokens with 19.5B system-specific data, Sigma achieves a 33.36% speed boost and excels in system tasks, surpassing GPT-4 by up to 52.5% on the AIMicius benchmark.

Redundancy Principles for MLLMs Benchmarks

This paper analyzes redundancy in Multi-modality Large Language Model (MLLM) benchmarks, examining overlapping capabilities, excessive test questions, and domain-specific redundancy. By evaluating hundreds of MLLMs across 20+ benchmarks, it highlights inefficiencies and offers principles to streamline future benchmark development for more effective model evaluation.

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

RealCritic introduces a benchmark to evaluate the critique capabilities of Large Language Models (LLMs) through a closed-loop approach. It assesses self-critique, cross-critique, and iterative critique across eight reasoning tasks, revealing gaps in classical LLMs' performance. This resource aims to advance LLMs' ability to generate effective, improvement-driven critiques.

🛠️TOP TOOLS

Programming Helper - AI-powered coding assistant designed to streamline the software development process.

AutoDraw - Transforms simple sketches into polished clip art

PDFGPT IO - Chatbot that allows users to interact with PDF documents through natural language queries.

StoryNest AI - AI-powered platform that transforms the traditional storytelling experience into an interactive and immersive journey.

DeftGPT - AI-powered tool that enhances online interactions and productivity through a versatile Chrome extension.

📲SOCIAL MEDIA

what's the longest task you've sent your Operator on so far?
🥇my record: 24 minutes
— Wes Roth (@WesRothMoney)
4:59 AM • Jan 24, 2025

🗞️MORE NEWS

Stargate, a $100 billion AI venture by OpenAI, Oracle, and SoftBank, plans to power new data centers with solar and batteries, aiming to address growing energy demands in AI-driven cloud computing amid looming power shortages.
Galileo's Agentic Evaluations helps enterprises ensure AI agents work reliably by evaluating tool selection, detecting errors, and tracking success. This solution addresses AI reliability concerns as demand for trustworthy, scalable AI systems grows.
Sam Altman-backed Retro Biosciences secures $1 billion to extend human lifespan by 10 years, intensifying the biotech race to revolutionize aging and longevity through innovative science and transformative medical advancements.
Chinese startup DeepSeek's AI Assistant surpassed ChatGPT as the top free app on Apple’s U.S. App Store. Its cost-effective, high-performance AI model challenges U.S. dominance and raises questions about tech export controls.
Amazon Bedrock now offers Luma AI's Ray2 video model, enabling high-quality video generation from text prompts. Ray2 supports realistic visuals, motion, and logical sequences, catering to content creation, media, and advertising applications.
A leaked memo reveals Apple's 2025 AI priorities: enhancing Siri's capabilities and advancing in-house AI models. The company aims to compete with rivals like ChatGPT and Google's Gemini by improving AI performance and reliability.

What'd you think of today's edition?

Reply

or to participate.