AI megathread

Typoz · Typoz 2025-04-19, 08:20 AM

Those tests using fairly straightforward maths problems are revealing, because the mistakes and fabrications can be identified and shown for what they are. But it's worrying if the AI tool is being used to produce outputs relating to more loosely-defined real-world subjects. It may be much less obvious when the output is simply nonsense, because it may appear plausible.

Sciborg_S_Patel · Sciborg_S_Patel 2025-04-24, 06:36 PM

OpenAI’s dirty December o3 demo doesn’t readily replicate

Gary Marcus

Quote:Later, after I wrote that piece, I discovered that one of their demos, on FrontierMath, was fishy in a different way: OpenAI had privileged access to data their competitors didn’t have, but didn’t acknowledge this. They also (if I recall) failed to disclose their financial contributions in developing the test. And then a couple weeks ago we all saw that current models struggled mightly on the USA Math Olympiad problems that were fresh out of the oven, hence hard to prepare for in advance.

Today I learned that the story is actually even worse than all that: the crown jewel that they reported on the demo — the 75% on Francois Chollet’s ARC test (once called ARC-AGI) doesn’t readily replicate. Mike Knoop from the ARC team reports “We could not get complete data for o3 (high) test due to repeat timeouts. Fewer than half of tasks returned any result exhausting >$50k test budget. We really tried!” The model that is released as “o3 (high)” presumed to be their best model, can’t readily yield whatever was reported in December under the name o3.

The best stable result that ARC team could get from experimenting with the latest batch of publicly-testable OpenAI models was 56% with a different model called o3-medium, still impressive, still useful, but a long way from the surprising 75% that was advertised.

And the lower 56% is not much different from what Jacob Andreas’s lab at MIT got in November. It’s arguably worse; if I followed correctly, and if the measures are the same, Andreas lab’s best score was actually higher, at 61%.

Four months later, OpenAI, with its ever more confusing nomenclature, has released a bunch of models with o3 in the title, but none of them can reliably do what was in the widely viewed and widely discussed December livestream. That’s bad.

Forgive if me I am getting Theranos vibes.

AI megathread

Typoz

Sciborg_S_Patel