What's AI's most notorious bug? It's not code crashes, but hallucinations—the model confidently fabricates facts, making it difficult to distinguish true from false. This fundamental challenge is a key obstacle to our complete trust in AI. It's practically common knowledge that large models can experience hallucinations, forcing anyone who seriously uses them to exercise caution. OpenAI also pointed out: "ChatGPT also hallucinates. GPT-5 has significantly fewer hallucinations, especially when performing reasoning, but hallucinations still occur. Hallucinations remain a fundamental challenge facing all large language models." Although the academic community has proposed various methods to reduce model hallucinations, there is currently no cure that can completely "cure" model hallucinations. So, why do large models hallucinate? Today, OpenAI has published a rare paper that systematically reveals the root causes of hallucinations. First, define hallucinations. OpenAI's simple definition is: "A situation where the model confidently generates an answer that is not true."
As for the reason, it's simple:Standard training and evaluation procedures tend to reward guesses rather than rewarding models when they have the courage to admit uncertainty.

Let's take a closer look at OpenAI What exactly was discovered.
What are hallucinations?
Hallucinations are plausible statements generated by language models that are false.
They can appear in unexpected ways even for seemingly simple questions. For example, OpenAI cites a case in which when different widely used chatbots were asked the title of Adam Tauman Kalai’s (first author) doctoral dissertation, they confidently gave three different answers, none of which were correct. When asked about his birthday, it gave three different dates, all of which were wrong. Learning for Testing OpenAI says hallucinations persist in part because current evaluation methods set the wrong incentives. While evaluation itself doesn't directly lead to hallucinations, most ways of assessing model performance encourage models to make guesses rather than honestly confront uncertainty. Think of it like a multiple-choice test. If you don't know the answer but try to guess, you might get lucky and guess right. Leaving it blank guarantees a score of zero. Similarly, when models are scored solely on accuracy (that is, the percentage of questions they answer correctly), they are incentivized to guess rather than admit "I don't know." For another example, imagine a language model is asked someone's birthday and doesn't know it. If it guesses "September 10th," it has a 1 in 365 chance of guessing correctly. Saying "I don't know" guarantees a score of zero. Across thousands of test questions, models that guessed ultimately outperformed models that were cautious and admitted uncertainty. For questions with only one "correct answer," there are three types of answers to consider: the correct answer, the incorrect answer, and the abstention answer, which the model doesn't want to risk. OpenAI says that rejecting answers is part of its humility metric, one of its core values. Most score metrics prioritize models based on accuracy, but incorrect answers are worse than rejecting answers. OpenAI's model specifications state that it's better to indicate uncertainty or ask for clarification rather than confidently provide information that may be incorrect. Take the SimpleQA evaluation in the GPT5 system card as an example. In terms of accuracy, the earlier OpenAI o4-mini model performed slightly better. However, its error rate (i.e., hallucination rate) was significantly higher. Making strategic guesses under uncertainty can improve accuracy but also increase errors and hallucinations. Most benchmarks remove accuracy metrics when averaging results from dozens of evaluations, but this leads to a false dichotomy between right and wrong. In simple evaluations like SimpleQA, some models achieve near-100% accuracy, dispelling the illusion. However, in more challenging evaluations and real-world usage, accuracy can become pegged below 100% because the answers to some questions cannot be determined for various reasons, such as unavailable information, limited thinking capabilities of small models, or ambiguities that need clarification. Despite this, evaluation metrics based solely on accuracy continue to dominate leaderboards and model cards, encouraging developers to build models that guess rather than back off.
That’s why, even as models become more advanced, they continue to hallucinate. One reason is that they tend to confidently give incorrect answers rather than admit uncertainty.
Better evaluation methods
OpenAI has a simple solution: penalize confident errors more than uncertainty, and give partial credit for expressing uncertainty appropriately.
This idea is not new. Some standardized tests have long used methods such as negatively scoring incorrect answers or giving partial credit for questions left blank to discourage blind guessing. Some research teams have also explored evaluation methods that account for uncertainty and calibration.
But OpenAI says simply adding a few new uncertainty-aware tests is not enough. Widely used accuracy-based evaluation methods need to be updated so that their scores discourage guessing.
If the main evaluation metric continues to reward a model for lucky guesses, the model will continue to learn to guess.
Modifying evaluation metrics could broaden the adoption of hallucination reduction techniques, including both newly developed and previously researched ones.
How Hallucinations Emerge from Next-Word Predictions We’ve discussed why hallucinations are so difficult to get rid of, but where exactly do these highly specific factual errors come from? After all, large pre-trained models rarely make other kinds of mistakes, such as spelling errors and mismatched brackets. OpenAI says the difference must lie in what patterns are present in the data. A language model is first learned through pre-training, the process of predicting the next word in a large collection of text. Unlike traditional machine learning problems, each sentence doesn't have a "true/false" label. The model sees only positive examples of fluent language and must approximate the entire distribution. Distinguishing valid from invalid sentences is more difficult when there aren't any examples labeled as invalid. But even with labels, some errors are inevitable. To understand why, consider a simpler analogy. In image recognition, if millions of photos of cats and dogs are labeled as "cat" or "dog," an algorithm can learn to reliably classify them. But imagine labeling each pet photo with the pet's birthday. Because birthdays are inherently random, this task will always produce errors, no matter how advanced the algorithm. The same principle applies to pre-training. Spellings and brackets follow consistent patterns, so these errors disappear with scale. But arbitrary, low-frequency facts like a pet's birthday cannot be predicted from patterns alone, and thus can lead to hallucinations. OpenAI's analysis explains which types of hallucinations are produced by next-word prediction. Ideally, subsequent stages after pre-training should be able to remove these hallucinations, but for the reasons described in the previous section, this has not been fully achieved.
Summary
OpenAI says: "We hope that the statistical perspective presented in this paper will shed light on the nature of hallucinations and refute some common misconceptions":
Some claim
that hallucinations can be removed by improving accuracy because a 100% accurate model would never hallucinate. Finding: Accuracy will never reach 100% because some real-world questions are inherently unanswerable, regardless of model size, search, and reasoning capabilities. Some claim: Hallucinations are inevitable. Finding: Hallucinations are not inevitable because language models can give up on an answer when they are unsure. Some claim that avoiding hallucinations requires a level of intelligence that only large models can achieve. It was found that small models are more likely to understand their limitations. For example, when asked to answer a question in Maori, a small model that doesn't know Maori can simply answer "I don't know," while a model that knows some Maori must determine its confidence level. As discussed in the paper, "calibration" requires far less computation than maintaining accuracy. It has been claimed that hallucinations are a mysterious flaw in modern language models.
It has been discovered that we can understand the statistical mechanisms that produce hallucinations and reward them in assessments.
It has been claimed that to measure hallucinations, we only need a good hallucination assessment. Findings: Several hallucination assessments have been published. However, a good hallucination assessment performs poorly compared to hundreds of traditional accuracy-based assessments that penalize humility and reward guesswork. Instead, all major evaluation metrics need to be redesigned to reward the expression of uncertainty.
OpenAI said: "Our latest models have an even lower hallucination rate, and we will continue to work on further reducing the confidence error rate of the language model output."
Incidentally, according to TechCrunch, OpenAI is reorganizing its
Model Behaviorteam, a small but influential group of researchers who determine how the company's AI models interact with people. The team will now report to Max Schwarzer, OpenAI's head of post-training. Joanne Jang, the founding leader of the team, will launch a new project within the company called oai Labs. According to her tweet: "This is a research-oriented team focused on inventing and prototyping new interfaces for people to collaborate with AI."
