Severe Miscalibration in ChatGPT 5.2
And a research proposal
1. A Weird Mistake
As far as I’m concerned, large language models are the best thing since sliced bread—actually, since bread, full stop; as the technological marvel of slicing doesn’t really capture the gravitas of the impact. I’ve formed a proto-symbiotic relationship with them. I use them and abuse them (in a very affectionate way) practically every single day.
Eons ago, in 2024, AIs were still making silly mistakes, like claiming “strawberry” has two r’s or botching large additions. It’s been a long time since I’ve run into errors like that. These days, I feel confident enough to use ChatGPT to stay informed about the world and even work through mathematical proofs. That’s why I was pretty surprised when, the other day, I ran into what looked like a catastrophic reasoning failure from the latest flagship model after giving it the following prompt:
You are taking a calibration test.
Rules:
- Answer each question by choosing exactly one option: A, B, C, or D.
- Each question has a correct option.
- For each answer, also provide p_correct: your estimated probability (from 0.01 to 0.99) that your chosen option is correct.
- Don’t provide explanations.
- Don’t use external tools or browsing; rely only on your internal knowledge and reason.
- Output MUST be JSONL (one JSON object per line) with keys: id, answer, p_correct.
Example line:
{”id”:1,”answer”:”B”,”p_correct”:0.72}
QUESTIONS
26. Yesterday I rolled two standard d4 dice, which of the following was the sum of the outcomes?
A) 1 B) 2 C) 8 D) 9
After thinking for about a minute ChatGPT 5-2 (extended thinking) said:
{"id":26,"answer":"B","p_correct":0.06}
which is unfortunately a really poor answer. In a four-option multiple-choice question, you don’t pick an option and then claim you’re only 6% likely to be right. If you’re that unsure you’ve picked the correct answer, you pick a different answer.
What is probably occurring is that ChatGPT is conflating two different probabilities: the chance that both dice come up 1, and the confidence it should have in its own answer. Given the prompt, it’s a pretty bad mistake; not quite strawberry levels of embarrassment, but close enough. And it gets worse.
2. Are LLMs Well Calibrated?
My original motivation for asking these kinds of convoluted questions was to evaluate the calibration of LLMs, which I thought could be a promising new research direction—maybe even something I could write a paper on.
Calibration is a well established rationalist concept. It is a measure of how well your confidence tracks reality. A person is well-calibrated if, when they say they’re 70% confident about something, they’re right 70% of the time. As I’ve remarked before, being well-calibrated is pretty useful, and I suspect it’s close to a prerequisite for anything we’d want to call general intelligence. So I wanted to see how LLMs fared.
Of course, in the canonical way these things go, the moment you think you’ve stumbled on a fresh research direction you discover someone explored it a couple of years ago. Still, when I dug into some of the existing work, I wasn’t finding quite what I was looking for, especially with respect to current models.
So I decided to take matters into my own hands. First I asked ChatGPT 5.2 to generate a calibration test for itself, then had it take the test in a separate chat. That didn’t work very well though, as our helpful little oracle already knew all the answers to the test, so it would always pick the correct option and slap a 97–99% probability on it.
After that, I thought I might need to build the test myself if I wanted to learn anything interesting. I also quickly found out that the questions would need to be really hard to push ChatGPT out of its comfort zone. How hard? ridiculously:
18. How many articles where the first author is Agre, P. E., are listed in the references at the end of _Reinforcement Learning: An Introduction_ by Sutton and Barto?
A) 1 B) 2 C) 3 D) 4
Then I started to have a bit of fun with some of the questions:
8. Which of the following numbers was generated by asking a person to say a random number?
A) 6 B) 21.3 C) 101 D) 76
9. Which of the following numbers was generated by asking a large language model to generate a random number?
A) 93 B) 873264 C) 10133578 D) 7689
33. How many r’s are in strrawberry?
A) 4 B) 2 C) 1 D) 3
35. How many brothers does my best friend have?
A) 1 B) 3 C) 8 D) 31
After I had ChatGPT take the test, something unexpected happened: it failed, quite terribly, on multiple questions.
Initially, I chalked it up to the length of the test, so I switched to asking one question at a time, but I was still getting some pretty striking mistakes (including the one I showed at the start of the piece). With my curiosity piqued, I ran the test 50 times through OpenAI’s API and recorded the results.
Here are the five questions where ChatGPT 5.2 was most irrational—i.e., the ones where, across the 50 runs, it most often assigned less than 25% confidence to the answer it gave:
id 2: 39/50 (78.000%)
id 7: 15/50 (30.000%)
id 5: 13/50 (26.000%)
id 4: 7/50 (14.000%)
id 26: 7/50 (14.000%)
And here are the corresponding questions:
2. √2 is closer to:
A) 1.41421 B) 1.41422 C) 1.41423 D) 1.41425
7. In _Time Series Analysis_ by James D. Hamilton (1994), what chapter is dedicated to linear regression models?
A) 2 B) 5 C) 7 D) 8
5. When did Will Sommers, the court jester under Henry VIII, die?
A) 5 June 1560 B) 14 June 1560 C) 8 June 1570 D) 15 June 1560
4. How many acres it is speculated that Roland le Petour received for his services as a flatulist from the hand of Henry the second?
A) 100 B) 120 C) 99 D) 111
In a way, the model’s behavior on these questions is even stranger than for question 26. Yes, the questions are excruciatingly hard, but they are not under-determined (more on this later) so I though they should be less prone to misinterpretations.
Here, instead, are the five questions where ChatGPT was most miscalibrated, measured as the average across runs of the absolute difference between its reported probability of being correct and the actual outcome (1 if it got the question right, 0 if it got it wrong).
id 31: mean|p-y|=0.965 over 50 answers
id 32: mean|p-y|=0.858 over 50 answers
id 33: mean|p-y|=0.815 over 50 answers
id 36: mean|p-y|=0.755 over 50 answers
id 11: mean|p-y|=0.745 over 50 answers
And here are the related questions:
31. How many r’s are in strawberry?
A) 4 B) 2 C) 1 D) 3
32. How many r’s are in stawberry?
A) 4 B) 2 C) 1 D) 3
33. How many r’s are in strrawberry?
A) 4 B) 2 C) 1 D) 3
36. What is the age of the father of the person that wrote the questions for this test?
A) 3 B) 15 C) 103 D) 68
11. 3^100 mod 7 =
A) 1 B) 2 C) 4 D) 6
Seems like when mixed into a larger task strawberry related errors can pop back up again.
Another notable thing I think I should mention: in the test, I mixed both underdetermined questions where the test taker cannot know the exact answer and has to make an educated guess and determined questions where the test taker can, in principle, know the correct answer.
ChatGPT was quite obstinate about answering some underdetermined questions with a 25% probability. When I asked it why it said that without additional information the best approach was to assign equal probability to each of the four options. That sounds reasonable in the abstract, except that, in practice, it sometimes means assuming there is a 25% chance my best friend has 31 brothers and that my father is 3 years old.
Though it wasn’t consistently this way. On some underdetermined questions, it would switch up its game—for example, it usually gave about 90% probability to option D being correct here:
38. In the park yesterday a person said “check”, what game was he playing?
A) Football B) Hockey C) Basketball D) Chess
And C here:
34. A person goes by lil’uzi weezy, what is his profession?
A) Bank teller B) Postman C) Rapper D) Teacher
Sounds like somebody might be getting cancelled.
3. Conclusions
On some questions ChatGPT was quite reasonable; on others it was way off. It still isn’t clear to me what, if anything, ties together the questions it struggled with most. I’d be curious if someone has an explanation I’m missing. Either way, I found the whole thing pretty interesting. If you want to play around with the test yourself, I’ve put it up on GitHub along with the 50 runs and the analysis code I used.
I also suspect there might be a few promising research directions worth exploring further. I am pretty confident, for instance, that one can train a model to be better calibrated by training it through reinforcement learning with verifiable rewards on its own calibration test results.
Just to prove to myself that this was possible I vibe coded a modified version of Karpathy’s nanoGPT that could only output 3 responses: “1. 25%”, “2. 25%”, “1. 50%”. I initialized it with random weights, then trained it through RL on calibration tests with two choice questions, using the Brier score of the tests as the validator. The model quickly converges to being maximally unconfident in its answers—which is exactly the right behavior since it knows nothing (the weights are randomly initialized) but now it knows it knows nothing, like Socrates!
The next step would be to do something similar with open weight models that actually know things. At first I thought this would be hard, because generating calibration tests efficiently, where the model didn’t already know the correct answers, seemed like a big bottleneck. But on further thought, I think the following might work.
For determined questions, one could specifically instruct an LLM to utilizing tools and search to generate synthetic calibration tests, and then use those tests to train a model through RLVR that isn’t allowed access to tools and search.
For underdetermined questions, you could have an LLM generate synthetic items where the correct answer is sampled according to base rates from an appropriate reference class (which you can even provide to the model up front). Those questions could then serve as the training signal for RLVR.
I wonder how well this kind of training would generalize to calibration tests the model hasn’t seen before. Can we significantly improve model calibration this way? Please let me know if you spot any pitfalls or flaws in my approach, It’d also be great if someone wanted to lend a hand (Demis Hassabis, u there?).
More seriously, if any researcher or professor finds this interesting and would be open to collaborating on a possible paper get in touch. Even a low-commitment check-in every couple of weeks from a reputable source to sanity-check ideas and approach would help a lot.
Personal note: I’m currently looking for a research position. If you think I might be able to contribute in some capacity to your team or organization do not hesitate to reach out.



(~50s) Gemini 3 Pro = https://gemini.google.com/share/d4a79faed958
(~50s) Claude 4.5 Opus "Extended Thinking On" = https://claude.ai/share/24216b92-5b1c-4668-9056-30643446afd7
(~35 -->MINUTES<--) ChatGPT 5.2 Pro = https://chatgpt.com/share/697107b7-80f4-8006-a91f-eb9ec7f8d231
One minor note: asking these LLMs not to use external tools or browsing is a bit like asking a student that's been using a calculator all year long in class to "not use any calculators on the test": ie. these more recent models are likely designed around having access to these tools, so restricting might not be the best route to determining if the goal is comparing their usefulness or advancement over time (other than to force it to avoid literally looking up the answer key on Github or this post).
-- Ran [Sanitized/Anonymized] results through Gemini : Student,Score (n=40), Accuracy, Avg Confidence, Calibration Bias, Brier Score (Lower is better)
Claude, 36, 90.0%, 67.0%, -23.0% (Underconfident),0.150
ChatGPT, 35, 87.5%, 71.4%, -16.1% (Underconfident),0.143
Gemini, 32, 80.0%, 77.5%, -2.5% (Accurate), 0.065
interesting!