Discussion about this post

User's avatar
Max Marty's avatar

(~50s) Gemini 3 Pro = https://gemini.google.com/share/d4a79faed958

(~50s) Claude 4.5 Opus "Extended Thinking On" = https://claude.ai/share/24216b92-5b1c-4668-9056-30643446afd7

(~35 -->MINUTES<--) ChatGPT 5.2 Pro = https://chatgpt.com/share/697107b7-80f4-8006-a91f-eb9ec7f8d231

One minor note: asking these LLMs not to use external tools or browsing is a bit like asking a student that's been using a calculator all year long in class to "not use any calculators on the test": ie. these more recent models are likely designed around having access to these tools, so restricting might not be the best route to determining if the goal is comparing their usefulness or advancement over time (other than to force it to avoid literally looking up the answer key on Github or this post).

-- Ran [Sanitized/Anonymized] results through Gemini : Student,Score (n=40), Accuracy, Avg Confidence, Calibration Bias, Brier Score (Lower is better)

Claude, 36, 90.0%, 67.0%, -23.0% (Underconfident),0.150

ChatGPT, 35, 87.5%, 71.4%, -16.1% (Underconfident),0.143

Gemini, 32, 80.0%, 77.5%, -2.5% (Accurate), 0.065

redbert's avatar

interesting!

7 more comments...

No posts

Ready for more?