AI megathread

Sciborg_S_Patel · Sciborg_S_Patel 2025-04-07, 04:45 AM

(2025-04-07, 03:29 AM)Laird Wrote: You know what they say about assumptions. Here's the DeepSeek source code.

Grab your debugger and have at it. Report back to us pronto, because we're all keen to know what on earth these black boxes are actually doing.

Ah I was thinking about Western companies like Grok / Open AI / etc, and not DeepSeek. It is true they released their source code. However my understanding was they provided their weighting without showing us how those weights were arrived at?

We also don't know how the output given to us over the web [from Grok] was achieved.

Additionally I don't think any one person on their home computer can sort out the mathematics/algorithms of what is happening in these black boxes?

Also it's on those making the claim that LLMs are thinking while other programs aren't. Levin's claim is that even basic algorithms are thinking - like his example of Bubble Sort - because the physical substrate is a pointer to a different reality. [This] isn't really the same thing as Computationalism.

I'm more fine with Levin's Platonic Dualism because he is not saying Turing Machines magically start thinking because the physical substrate is executing particular programs, but rather because the mental reality is being accessed.

All that said Grok seems good at summarizing at times, but I don't see any sign of genuine reasoning in that conversation in the other thread.

***Laird*** · ***Laird*** 2025-04-07, 04:51 AM Administrator

(2025-04-07, 04:45 AM)Sciborg_S_Patel Wrote: However my understanding was they provided their weighting without showing us how those weights were arrived at?

I do remember reading something like that a while back, but I'm not totally clear on it. In any case, I haven't looked at the source code myself.

(2025-04-07, 04:45 AM)Sciborg_S_Patel Wrote: Additionally I don't think any one person on their home computer can sort out the mathematics/algorithms of what is happening in these black boxes?

Yep, I was just teasing you by calling your bluff. Big Grin

***Laird*** · ***Laird*** 2025-04-07, 05:07 AM Administrator

(2025-04-06, 11:31 PM)Sciborg_S_Patel Wrote: Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics

Hamed Mahdavi et al

Interesting. I've just given it a read. It does at face value seem to be difficult to reconcile with o3's performance on the FrontierMath problem set, or at least to cast doubt that that performance was based on correct mathematical reasoning given that only correct answers counted, and that this study shows that correct answers were very rarely associated with correct mathematical reasoning.

Here are a few observations that might (but, in fairness, might not) help with a reconcilation and diffusion of doubt:

Firstly, this research didn't test o3 itself, only o3-mini.

Secondly, and much more importantly, my understanding is that when o3 was tested on the FrontierMath problem set, it was given vast and very expensive access to computational time and resources, far beyond what would likely have been used in the testing done for this paper on the other models.

Thirdly, it seems likely that the FrontierMath problem set was much more difficult than the Olympiad-candidate problem set used in this study, and in turn then it seems much more likely that a correct solution to the FronterMath problems could only be arrived at by correct reasoning.

YMMV.

Sciborg_S_Patel · Sciborg_S_Patel 2025-04-07, 05:11 AM

(2025-04-07, 04:51 AM)Laird Wrote: I do remember reading something like that a while back, but I'm not totally clear on it. In any case, I haven't looked at the source code myself.

Yep, I was just teasing you by calling your bluff.

Ah, took me a sec to understanding what you meant by "bluff" - > I do think these companies could provide a clearer trace of the code, I didn't mean to imply I personally could trace through all the data processing. Apologies.

I don't trust any private company, at this point, to give us an honest evaluation of their AI products.

Better understanding of the black box is something I do hope comes out of DeepSeek playing their trump card, since I am unconvinced by Levin's arguments on this.

Sciborg_S_Patel · Sciborg_S_Patel 2025-04-07, 06:04 AM

(2025-04-07, 05:07 AM)Laird Wrote: Interesting. I've just given it a read. It does at face value seem to be difficult to reconcile with o3's performance on the FrontierMath problem set, or at least to cast doubt that that performance was based on correct mathematical reasoning given that only correct answers counted, and that this study shows that correct answers were very rarely associated with correct mathematical reasoning.

Here are a few observations that might (but, in fairness, might not) help with a reconcilation and diffusion of doubt:

Firstly, this research didn't test o3 itself, only o3-mini.

Secondly, and much more importantly, my understanding is that when o3 was tested on the FrontierMath problem set, it was given vast and very expensive access to computational time and resources, far beyond what would likely have been used in the testing done for this paper on the other models.

Thirdly, it seems likely that the FrontierMath problem set was much more difficult than the Olympiad-candidate problem set used in this study, and in turn then it seems much more likely that a correct solution to the FronterMath problems could only be arrived at by correct reasoning.

YMMV.

Seems that another paper presents additional poor results? ->

Reports of LLMs mastering math have been greatly exaggerated

Gary Marcus

Quote:The USA Math Olympiad is an extremely challenging math competition for the top US high school students; the top scorers get prizes and an invitation to the International Math Olympiad. The USAMO was held this year March 19-20. Hours after it was completed, so there could be virtually no chance of data leakage, a team of scientists gave the problems to some of the top large language models, whose mathematical and reasoning abilities have been loudly proclaimed: o3-Mini, o1-Pro, DeepSeek R1, QwQ-32B, Gemini-2.0-Flash-Thinking-Exp, and Claude-3.7-Sonnet-Thinking. The proofs output by all these models were evaluated by experts. The results were dismal: None of the AIs scored higher than 5% overall.

Quote:To be sure, a poor showing on the USAMO is not in itself a shameful result. These problems are awfully difficult; many professional research mathematicians have to work hard to find the solution. What matters here is the nature of the failure: the AIs were never able to recognize when they had not solved the problem. In every case, rather than give up, they confidently output a proof that had a large gap or an outright error. To quote the report: “The most frequent failure mode among human participants is the inability to find a correct solution. Typically, human participants have a clear sense of whether they solved a problem correctly. In contrast, all evaluated LLMs consistently claimed to have solved the problems.”

Though AlphaGeometry does better:

Quote:Importantly, the neurosymbolic method used by DeepMind’s AlphaProof and AlphaGeometry systems (which we discussed recently) which (more or less) achieved a silver-medal level performance on the 2024 International Math Olympiad, is immune to this problem. AlphaProof and AlphaGeometry generate a completely detailed symbolic proof that can be fed into a formal proof verifier. They can fail to find a proof, but they cannot generate an incorrect proof. But that is because they rely in part on powerful, completely hand-written, symbolic reasoning systems. LLMs are not similarly immune.

***Laird*** · ***Laird*** 2025-04-07, 07:10 AM Administrator

(2025-04-07, 06:04 AM)Sciborg_S_Patel Wrote: Seems that another paper presents additional poor results? ->

Reports of LLMs mastering math have been greatly exaggerated

Gary Marcus

So it does. For clarity, I wasn't disputing the results from the first paper, nor suggesting they couldn't be repeated, so, this doesn't really add anything important, nor change my response, which could pretty much be repeated verbatim.

Yes, these papers do put a damper on the general expectations of high mathematical reasoning ability of leading-edge LLMs that were raised by o3's FrontierMath results. As I said though, there are reasons to think they might be neither irreconcilable with nor invalidating of those results. Whether that's of any consolation is a subjective matter.

Sciborg_S_Patel · Sciborg_S_Patel 2025-04-07, 07:58 PM

Recent AI model progress feels mostly like bull@#$%

"lc"

Quote:I have read the studies. I have seen the numbers. Maybe LLMs are becoming more fun to talk to, maybe they're performing better on controlled exams. But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues' perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality. They are not reflective of my Lived Experience or the Lived Experience of my customers. In terms of being able to perform entirely new tasks, or larger proportions of users' intellectual labor, I don't think they have improved much since August.

Quote:Are the AI labs just cheating?
AI lab founders believe they are in a civilizational competition for control of the entire future lightcone, and will be made Dictator of the Universe if they succeed. Accusing these founders of engaging in fraud to further these purposes is quite reasonable. Even if you are starting with an unusually high opinion of tech moguls, you should not expect them to be honest sources on the performance of their own models in this race. There are very powerful short term incentives to exaggerate capabilities or selectively disclose favorable capabilities results, if you can get away with it. Investment is one, but attracting talent and winning the (psychologically impactful) prestige contests is probably just as big a motivator. And there is essentially no legal accountability compelling labs to be transparent or truthful about benchmark results, because nobody has ever been sued or convicted of fraud for training on a test dataset and then reporting that performance to the public. If you tried, any such lab could still claim to be telling the truth in a very narrow sense because the model "really does achieve that performance on that benchmark". And if first-order tuning on important metrics could be considered fraud in a technical sense, then there are a million other ways for the team responsible for juking the stats to be slightly more indirect about it...

...So maybe there's no mystery: The AI lab companies are lying, and when they improve benchmark results it's because they have seen the answers before and are writing them down. In a sense this would be the most fortunate answer, because it would imply that we're not actually that bad at measuring AGI performance; we're just facing human-initiated fraud. Fraud is a problem with people and not an indication of underlying technical difficulties.

I'm guessing this is true in part but not in whole.

Quote:I can't do any of the Humanity's Last Exam test questions, but I'd be willing to bet today that the first model that saturates HLE will still be unemployable as a software engineer. HLE and benchmarks like it are cool, but they fail to test the major deficits of language models, like how they can only remember things by writing them down onto a scratchpad like the memento guy. Claude Plays Pokemon is an overused example, because video games involve a synthesis of a lot of human-specific capabilities, but the task fits as one where you need to occasionally recall things you learned thirty minutes ago. The results are unsurprisingly bad.

Sciborg_S_Patel · Sciborg_S_Patel 2025-04-18, 03:07 PM

OpenAI’s o3 and Tyler Cowen’s Misguided AGI Fantasy

Gary Marcus

Quote:Fraser posted samples of o3 math stupidity within minutes of o3 being available...

...In a more systematic study of the addition of largish two digit integers, Fraser found performance of about 87%, not miserable, but far inferior to a zillion systems that would score a perfect 100% — yet business as usual for LLMs, but a sad commentary on how standards have fallen in the ChatGPT era.

Quote:A company called Transluce, which has been studying LLM chains of reasoning, was even more devastating, with a long thread that should be read in its entirety by anyone taking o3 seriously. It starts

Quote:As the wiser economist Brad DeLong just put it in a blunt essay, “if your large language model reminds you of a brain, it’s because you’re projecting—not because it’s thinking. It’s not reasoning, it’s interpolation. And anthropomorphizing the algorithm doesn’t make it smarter—it makes you dumber.”