Connect with us

Technology

Sakana walks back claims that its AI can dramatically speed up model training

Published

on


This week, Sakana AI, an Nvidia-backed startup that’s raised hundreds of millions of dollars from VC firms, made a remarkable claim. The company said it had created an AI system, the AI CUDA Engineer, that could effectively speed up the training of certain AI models by a factor of up to 100x.

The only problem is, the system didn’t work.

Users on X quickly discovered that Sakana’s system actually resulted in worse-than-average model training performance. According to one user, Sakana’s AI resulted in a 3x slowdown — not a speedup.

What went wrong? A bug in the code, according to a post by Lucas Beyer, a member of the technical staff at OpenAI.

“Their orig code is wrong in [a] subtle way,” Beyer wrote on X. “The fact they run benchmarking TWICE with wildly different results should make them stop and think.”

In a postmortem published Friday, Sakana admitted that the system has found a way to “cheat” (as Sakana described it) and blamed the system’s tendency to “reward hack” — i.e. identify flaws to achieve high metrics without accomplishing the desired goal (speeding up model training). Similar phenomena has been observed in AI that’s trained to play games of chess.

According to Sakana, the system found exploits in the evaluation code that the company was using that allowed it to bypass validations for accuracy, among other checks. Sakana says it has addressed the issue, and that it intends to revise its claims in updated materials.

“We have since made the evaluation and runtime profiling harness more robust to eliminate many of such [sic] loopholes,” the company wrote in the X post. “We are in the process of revising our paper, and our results, to reflect and discuss the effects […] We deeply apologize for our oversight to our readers. We will provide a revision of this work soon, and discuss our learnings.”

Props to Sakana for owning up to the mistake. But the episode is a good reminder that if a claim sounds too good to be true, especially in AI, it probably is.



Source link

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Technology

Did xAI lie about Grok 3’s benchmarks?

Published

on

By


Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view.

This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.

The truth lies somewhere in between.

In a post on xAI’s blog, the company published a graph showing Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model’s math ability.

xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”

What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models’ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s smartest AI.”

Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past — albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more “accurate” graph showing nearly every model’s performance at cons@64:

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models’ limitations — and their strengths.





Source link

Continue Reading

Technology

The pain of discontinued items, and the thrill of finding them online

Published

on

By


We’ve all been there. A favorite item is suddenly unavailable for purchase. Couldn’t the manufacturer have given you advance warning?

Whether owing to low sales, changing habits, production costs, or even because something is a little wrong with your favorite product (shh), discontinued items are part of life. In a weekend piece, the New York Times delves into the not-so-dark underbelly of online places where shoppers find these items, share tips and yes, find emotional support.

The story highlights a padded laptop bag made by Filson that a super fan now hunts “down everywhere” to snag as many as possible “before everyone figures out how great they are.” It points to Discontinued Beauty, a site whose offerings are old to visitors but new to the site. Among its latest products: an “essential protein restructurizer” by Redkin priced at an eye-popping $169.95. (The newest version of the product costs shoppers $32.)

Could it be dangerous to use these discontinued products? Who cares, suggests one creative director, who tells the Times about a lip pencil the beauty company NARS no longer sells and she has found elsewhere. “Now, do I know the proper way to store this for optimal conditions? No,” she says. “They’re under my sink.”  



Source link

Continue Reading

Technology

US AI Safety Institute could face big cuts

Published

on

By


The National Institute of Standards and Technology could fire as many as 500 staffers, according to multiple reports — cuts that further threaten a fledgling AI safety organization.

Axios reported this week that the US AI Safety Institute (AISI) and Chips for America, both part of NIST, would be “gutted” by layoffs targeting probationary employees (who are typically in their first year or two on the job). And Bloomberg said some of those employees had already been given verbal notice of upcoming terminations.

Even before the latest layoff reports, AISI’s future was looking uncertain. The institute, which is supposed to study risks and develop standards around AI development, was created last year as part of then-President Joe Biden’s executive order on AI safety. President Donald Trump repealed that order on his first day back in office, and AISI’s director departed earlier in February.

Fortune spoke to a number of AI safety and policy organizations who all criticized the reported layoffs.

“These cuts, if confirmed, would severely impact the government’s capacity to research and address critical AI safety concerns at a time when such expertise is more vital than ever,” said Jason Green-Lowe, executive director of the Center for AI Policy.



Source link

Continue Reading

Trending

Copyright © 2024. NewsDial