Tuesday, November 25, 2025

ChatGPT and Gemini Can't Do Real Physics Research

A rather interesting post on AI's ability to actually do real physics research that beginning graduate students are expected to do.

More than 50 physicists from over 30 institutions built the "CritPt" benchmark ..... The benchmark asks models to solve original, unpublished research problems that resemble the work of a capable graduate student starting an independent project.

Google's "Gemini 3 Pro Preview" reached just 9.1 percent accuracy while using 10 percent fewer tokens than OpenAI's "GPT-5.1 (high)," which placed second at 4.9 percent. Even at the top of the leaderboard, the systems miss the vast majority of tasks.

It is fascinating to note that, based on this, the current AI engines do not seem to exhibit a clear ability for creativity and creative thinking that is not based on existing knowledge that they can refer to. They are "good" at stitching and pasting together information from various sources to come up with an answer, but not when there is nothing existing in the first place.

I actually had a chuckle when I read this part:

The models often produce answers that look convincing but contain subtle errors that are difficult to catch, which can easily mislead researchers and require time-consuming expert review.

If you have followed this blog for some time, you would have noticed my posts on my battles with ChatGPT when I gave it basic physics questions that my first-year physics students encounter. See here, here, and here. In many other cases, I find that it gives me the wrong answer but with the correct explanation, or it gives me the correct answer but the explanation simply doesn't match. As the article said, often times the errors are subtle, something that a student learning about the material will probably not catch.

Zz. 

1 comment:

Adam said...

I think a metric that is missing is how well a human would do compared to the best LLM. Given the example of physics problem (see p7 of the article https://arxiv.org/pdf/2509.26574), and the list given at the end of the paper, I'm not sure that most physicists (including me) would do as well as Gemini...

Also, your post about on ChatGPT are 2.5 years old. You might want to redo your tests with the latest version of the LLM to see how "bad" it now is at solving undergrad physics problems...