• NATURAL 20
  • Posts
  • OpenAI's Deep Research Tool Achieves Unprecedented Benchmark Performance

OpenAI's Deep Research Tool Achieves Unprecedented Benchmark Performance

PLUS: MLCommons Partners with Hugging Face to Create Multilingual Speech Resource, Amazon's Mississippi Data Center Costs Surge 60% and moe.

In partnership with

Try Artisan’s All-in-one Outbound Sales Platform & AI BDR

Ava automates your entire outbound demand generation so you can get leads delivered to your inbox on autopilot. She operates within the Artisan platform, which consolidates every tool you need for outbound:

  • 300M+ High-Quality B2B Prospects, including E-Commerce and Local Business Leads

  • Automated Lead Enrichment With 10+ Data Sources

  • Full Email Deliverability Management

  • Multi-Channel Outreach Across Email & LinkedIn

  • Human-Level Personalization

Today:

  • OpenAI's Deep Research Tool Achieves Unprecedented Benchmark Performance

  • OpenAI’s o3-mini Outshines Competitors

  • Google Expands Gemini’s Capabilities with Flash 2.0

  • MLCommons Partners with Hugging Face to Create Multilingual Speech Resource

  • Amazon's Mississippi Data Center Costs Surge 60%

OpenAI DEEP RESEARCH Surprises Everyone "Feel the AGI" Moment is here…

Deep Research, OpenAI's latest agenta capability, is a multi-step research tool that can search the internet for specific complex queries.

It performs exceptionally well on benchmarks, such as Humanity's last exam, which consisted of over 3,000 multiple choice and short answer questions across more than 100 subjects. The Deep C Car 1 model achieved 9.4 and 26.6% accuracy.

OpenAI has released the o3-mini model, the newest and most cost-efficient model in its reasoning series. This model, designed for STEM tasks, is available in ChatGPT and the API. It offers exceptional capabilities in science, math, and coding while maintaining low cost and reduced latency. It outperforms OpenAI o1-mini in competition math, PhD-level science questions, and software engineering tasks. The model also outperforms OpenAI o1-mini in coding tasks and general knowledge evaluations. 

Developers can choose between three reasoning effort options: low, medium, and high, allowing them to focus on complex challenges or prioritize speed. o3-mini is rolling out in the Chat Completions API, Assistants API, and Batch API.

Google has announced that its Gemini AI app is getting faster with Flash 2.0, delivering faster responses and stronger performance across key benchmarks. The updated model is rolling out to Gemini's web and mobile apps and will be available to all users. Google also announced that Gemini 1.5 Flash and 1.5 Pro will be available for "the next few weeks." The updated model also uses the newest version of the company's Imagen 3 AI text-to-image generator, delivering richer details and textures and following instructions with greater accuracy.

MLCommons, a nonprofit AI safety working group, has partnered with AI dev platform Hugging Face to release Unsupervised People's Speech, one of the world's largest collections of public domain voice recordings for AI research. The dataset contains over a million hours of audio spanning at least 89 languages, with the goal of supporting R&D in various areas of speech technology. However, AI datasets like Unsupervised People's Speech can carry risks, including biased data.

Amazon.com is expected to spend 60% more than previously announced on a massive data center project in Mississippi, underscoring the escalating costs for artificial intelligence infrastructure. The company will spend $16 billion to construct two data center campuses north of the state capital Jackson. When Amazon announced the project a year ago, the company put the price tag at $10 billion and called it “the single largest capital investment in Mississippi’s history.” 

🧠RESEARCH

GuardReasoner is a new safeguard for LLMs that guides guard models to learn reasoning. It achieves better performance, explainability, and generalizability, surpassing GPT-4o+CoT and LLaMA Guard 3 8B on average.

Large language models (LLMs) like OpenAI's o1 exhibit deep thinking but underthinking, where they frequently switch between thoughts without exploring potential solutions, leads to inadequate depth of reasoning and decreased performance. Researchers propose a decoding strategy with thought switching penalty TIP to address underthinking and improve problem-solving capabilities.

The paper discusses the improvement of distributed algorithms like DiLoCo, which allows for the distribution of training of large language models across multiple accelerators. By synchronizing only subsets of parameters in sequence, allowing workers to continue training while synchronizing, and quantizing data exchanged by workers, the researchers achieve similar quality while reducing bandwidth.

The study compares DeepSeek-R1 and OpenAI's o3-mini models, focusing on safety and human values. Using ASTRAL, the researchers found that DeepSeek-R1 is highly unsafe, answering unsafely to 11.98% of executed prompts, while OpenAI's o3-mini only answers to 1.19%.

🛠️TOP TOOLS

SEO GPT - AI-powered tool designed specifically for search engine optimization tasks. 

StockImg AI - AI-powered platform that revolutionizes visual content creation. 

Plazmapunk - AI-powered tool that transforms audio files into visually stunning music videos.

Boords - Designed to simplify and streamline the video production process. 

StudyX - AI-powered educational platform designed to provide comprehensive homework assistance and learning support for students.

📲SOCIAL MEDIA

🗞️MORE NEWS

  • OpenAI CEO Sam Altman has admitted that the company has been "on the wrong side of history" regarding open-source AI, signaling a potential shift in strategy as competition from China intensifies and efficient open models gain traction. The candid acknowledgment came during a Reddit "Ask Me Anything" session, just days after Chinese AI firm DeepSeek rattled global markets. 

  • Google DeepMind and Google Research have developed WeatherNext, a family of advanced AI models that produce state-of-the-art weather forecasts. These models are faster and more efficient than traditional physics-based models, yielding superior forecast reliability. They are also collaborating with BigQuery and Earth Engine to provide live forecasts four times per day and historical data for research and decision-making.

  • SoftBank Group founder Masayoshi Son and OpenAI chief Sam Altman will discuss Japan's AI ambitions at an event in Tokyo. They will seek support from Japanese companies to build data centers, power plants, and other AI-supporting hardware. 

  • Palona AI, a startup founded by former Google and Meta leaders, aims to provide personalized, emotive customer agents to non-technical enterprises. The company equips direct-to-consumer enterprises like pizza shops and electronics vendors with live, 24/7 customer support sales agents that reflect each business's brand personality, voice, inventory stock, and value proposition. 

  • Microsoft is forming the Advanced Planning Unit (APU) within its Microsoft AI business division to study the societal, health, and work implications of AI. The APU will operate from Microsoft AI's office and will combine cutting-edge research to explore possible scenarios for AI's future. 

  • A team of researchers from the Dana-Farber Cancer Institute, The Broad Institute of MIT, Harvard, Google, and Columbia University has developed an artificial intelligence model called EpiBERT. EpiBERT is inspired by BERT, a deep learning model that generates human-like language. The model was trained on data from hundreds of human cell types and was fed the genomic sequence, which is 3 billion base pairs long. 

  • OpenAI has used the Reddit subreddit, r/ChangeMyView, to test the persuasive abilities of its AI reasoning models. The company collects user posts from the subreddit and asks its AI models to write replies that would change the Reddit user's mind on a subject. 

What'd you think of today's edition?

Login or Subscribe to participate in polls.

Reply

or to participate.