One minor note: asking these LLMs not to use external tools or browsing is a bit like asking a student that's been using a calculator all year long in class to "not use any calculators on the test": ie. these more recent models are likely designed around having access to these tools, so restricting might not be the best route to determining if the goal is comparing their usefulness or advancement over time (other than to force it to avoid literally looking up the answer key on Github or this post).
-- Ran [Sanitized/Anonymized] results through Gemini : Student,Score (n=40), Accuracy, Avg Confidence, Calibration Bias, Brier Score (Lower is better)
No irrational answers across the board! ChatGPT and Claude still seem to struggle with underdetermined questions like 35 and 36 while Gemini is just amazing (aside from question 20).
I disagree a bit on the fact that asking to not use tools is wrong, we are trying to measure calibration which should be orthogonal to tool use. In any case thanks a bunch for this! Does it leave you with some reflections?
You're right that for calibration tool use is kinda orthogonal. Though these models may have been trained or are background prompted to rely on tool use to "feel confident" about answers - which is actually a good thing, we want these models to *want* to go search online for more info when they feel underconfident. As such, underconfidence may be a feature, not a bug, of modern "tool use expected" LLMs.
fyi models are known to be especially hideous at responding to multiple choice questions by simply stating the choice if they can't use CoT. CoT was on for this right?
(~50s) Gemini 3 Pro = https://gemini.google.com/share/d4a79faed958
(~50s) Claude 4.5 Opus "Extended Thinking On" = https://claude.ai/share/24216b92-5b1c-4668-9056-30643446afd7
(~35 -->MINUTES<--) ChatGPT 5.2 Pro = https://chatgpt.com/share/697107b7-80f4-8006-a91f-eb9ec7f8d231
One minor note: asking these LLMs not to use external tools or browsing is a bit like asking a student that's been using a calculator all year long in class to "not use any calculators on the test": ie. these more recent models are likely designed around having access to these tools, so restricting might not be the best route to determining if the goal is comparing their usefulness or advancement over time (other than to force it to avoid literally looking up the answer key on Github or this post).
-- Ran [Sanitized/Anonymized] results through Gemini : Student,Score (n=40), Accuracy, Avg Confidence, Calibration Bias, Brier Score (Lower is better)
Claude, 36, 90.0%, 67.0%, -23.0% (Underconfident),0.150
ChatGPT, 35, 87.5%, 71.4%, -16.1% (Underconfident),0.143
Gemini, 32, 80.0%, 77.5%, -2.5% (Accurate), 0.065
No irrational answers across the board! ChatGPT and Claude still seem to struggle with underdetermined questions like 35 and 36 while Gemini is just amazing (aside from question 20).
I disagree a bit on the fact that asking to not use tools is wrong, we are trying to measure calibration which should be orthogonal to tool use. In any case thanks a bunch for this! Does it leave you with some reflections?
You're right that for calibration tool use is kinda orthogonal. Though these models may have been trained or are background prompted to rely on tool use to "feel confident" about answers - which is actually a good thing, we want these models to *want* to go search online for more info when they feel underconfident. As such, underconfidence may be a feature, not a bug, of modern "tool use expected" LLMs.
interesting!
yeah i can spot a pitfall in your apprroach, why the fuck are you using REINFORCE for RL lmao? Other than that, cool project
fyi models are known to be especially hideous at responding to multiple choice questions by simply stating the choice if they can't use CoT. CoT was on for this right?
The first chat I posted is ChatGPT 5.2 extended reasoning, thought for a minute.
Curious for you to try and run these through the different 5.2 models (not just thinking) and see results. Ie. 5.2-instant and 5.2-pro.
Or if you don’t have the right account, send me the prompts and I’ll paste them into the models. Plus opus 4.5 thinking and Gemini pro.
The full prompt of the calibration test is public in the first Github link.