Tuesday, November 25, 2025

ChatGPT and Gemini Can't Do Real Physics Research

A rather interesting post on AI's ability to actually do real physics research that beginning graduate students are expected to do.

More than 50 physicists from over 30 institutions built the "CritPt" benchmark ..... The benchmark asks models to solve original, unpublished research problems that resemble the work of a capable graduate student starting an independent project.

Google's "Gemini 3 Pro Preview" reached just 9.1 percent accuracy while using 10 percent fewer tokens than OpenAI's "GPT-5.1 (high)," which placed second at 4.9 percent. Even at the top of the leaderboard, the systems miss the vast majority of tasks.

It is fascinating to note that, based on this, the current AI engines do not seem to exhibit a clear ability for creativity and creative thinking that is not based on existing knowledge that they can refer to. They are "good" at stitching and pasting together information from various sources to come up with an answer, but not when there is nothing existing in the first place.

I actually had a chuckle when I read this part:

The models often produce answers that look convincing but contain subtle errors that are difficult to catch, which can easily mislead researchers and require time-consuming expert review.

If you have followed this blog for some time, you would have noticed my posts on my battles with ChatGPT when I gave it basic physics questions that my first-year physics students encounter. See here, here, and here. In many other cases, I find that it gives me the wrong answer but with the correct explanation, or it gives me the correct answer but the explanation simply doesn't match. As the article said, often times the errors are subtle, something that a student learning about the material will probably not catch.

Zz.