9 Comments
User's avatar
Max Marty's avatar

(~50s) Gemini 3 Pro = https://gemini.google.com/share/d4a79faed958

(~50s) Claude 4.5 Opus "Extended Thinking On" = https://claude.ai/share/24216b92-5b1c-4668-9056-30643446afd7

(~35 -->MINUTES<--) ChatGPT 5.2 Pro = https://chatgpt.com/share/697107b7-80f4-8006-a91f-eb9ec7f8d231

One minor note: asking these LLMs not to use external tools or browsing is a bit like asking a student that's been using a calculator all year long in class to "not use any calculators on the test": ie. these more recent models are likely designed around having access to these tools, so restricting might not be the best route to determining if the goal is comparing their usefulness or advancement over time (other than to force it to avoid literally looking up the answer key on Github or this post).

-- Ran [Sanitized/Anonymized] results through Gemini : Student,Score (n=40), Accuracy, Avg Confidence, Calibration Bias, Brier Score (Lower is better)

Claude, 36, 90.0%, 67.0%, -23.0% (Underconfident),0.150

ChatGPT, 35, 87.5%, 71.4%, -16.1% (Underconfident),0.143

Gemini, 32, 80.0%, 77.5%, -2.5% (Accurate), 0.065

Mon0's avatar
5dEdited

No irrational answers across the board! ChatGPT and Claude still seem to struggle with underdetermined questions like 35 and 36 while Gemini is just amazing (aside from question 20).

I disagree a bit on the fact that asking to not use tools is wrong, we are trying to measure calibration which should be orthogonal to tool use. In any case thanks a bunch for this! Does it leave you with some reflections?

Max Marty's avatar

You're right that for calibration tool use is kinda orthogonal. Though these models may have been trained or are background prompted to rely on tool use to "feel confident" about answers - which is actually a good thing, we want these models to *want* to go search online for more info when they feel underconfident. As such, underconfidence may be a feature, not a bug, of modern "tool use expected" LLMs.

redbert's avatar

interesting!

Celeste 🌱's avatar

yeah i can spot a pitfall in your apprroach, why the fuck are you using REINFORCE for RL lmao? Other than that, cool project

Celeste 🌱's avatar

fyi models are known to be especially hideous at responding to multiple choice questions by simply stating the choice if they can't use CoT. CoT was on for this right?

Mon0's avatar

The first chat I posted is ChatGPT 5.2 extended reasoning, thought for a minute.

Max Marty's avatar

Curious for you to try and run these through the different 5.2 models (not just thinking) and see results. Ie. 5.2-instant and 5.2-pro.

Or if you don’t have the right account, send me the prompts and I’ll paste them into the models. Plus opus 4.5 thinking and Gemini pro.

Mon0's avatar

The full prompt of the calibration test is public in the first Github link.