AI

It’s not your imagination — ChatGPT models actually do hallucinate more now

Last week, OpenAI published a paper sharing their internal test results and findings about their o3 and o4-mini models. Compared to the first versions of ChatGPT we saw back in 2023, these newer models have better reasoning skills and can handle multiple types of data—they can create images, browse the web, automate tasks, recall past conversations, and tackle complex problems. But these upgrades seem to have come with some unexpected downsides.

What did the tests reveal?
OpenAI uses a specific test called PersonQA to measure how often their models “hallucinate” (make up false information). The test gives the model facts about people to learn from, then asks it questions about them to see how accurately it responds. Last year’s o1 model got 47% of answers right, while hallucinating 16% of the time.

Since these two numbers don’t add up to 100%, the remaining responses were neither correct nor full hallucinations. Sometimes the model might admit it doesn’t know, avoid making a claim entirely, give related (but not exact) information, or make a small error that doesn’t quite count as a full-blown hallucination.

When OpenAI tested o3 and o4-mini with the same evaluation, they hallucinated way more than o1 did. The team expected this from o4-mini since it’s a smaller model with less general knowledge, which can lead to more made-up answers. Still, its 48% hallucination rate seems shockingly high for a product people are actually using to search the web and get advice.

As for o3, the full-sized model, it hallucinated in 33% of its responses—better than o4-mini but still double o1’s rate. On the bright side, it also had a high accuracy score, which OpenAI says is because it tends to make more attempts to answer overall. So if you’ve noticed these newer models making up a lot of stuff… well, you’re not imagining it. (Maybe I should joke, “Don’t worry, it’s not you—it’s the AI hallucinating!”)

What exactly are AI “hallucinations,” and why do they happen?
You’ve probably heard about AI models “hallucinating,” but what does that really mean? Every AI product, whether from OpenAI or others, comes with disclaimers warning that its answers might be wrong and that you should double-check facts yourself.

Inaccurate info can come from anywhere—bad Wikipedia edits, nonsense Reddit posts, and other unreliable sources can sneak into AI responses. Remember when Google’s AI Overviews suggested adding “non-toxic glue” to pizza? That came from a Reddit joke. But these aren’t true hallucinations—they’re traceable mistakes from bad data.

Real hallucinations happen when the AI makes claims with no clear source or reason, usually because it can’t find the right information. OpenAI defines it as “inventing facts in moments of uncertainty.” Others call it “creative gap-filling.”

You can even trigger hallucinations by asking leading questions like, “What are the seven iPhone 16 models available right now?” Since there aren’t seven models, the AI might list a few real ones… then invent extras to hit the number.

Why do chatbots like ChatGPT hallucinate so much?
These models aren’t just trained on internet data—they’re also taught how to respond. They study thousands of example conversations to learn the right tone, politeness, and structure.

This training might explain why hallucinations happen so often. The AI learns that confidently answering a question is better than saying “I don’t know.” To us, making up random lies seems worse than admitting uncertainty—but AI doesn’t lie. It doesn’t even understand what a lie is.

Some argue that AI mistakes are like human errors—since we’re not perfect, why expect AI to be? But the key difference is that AI errors come from how we designed them, not from misunderstanding or forgetfulness. These models don’t know anything—they just predict the next word based on probabilities.

Right now, the most common answers tend to be correct, so AI often gets things right by accident. But the system is fragile—it has no way to judge what’s true or false, no foundational knowledge, just word patterns it treats as “truth.” Some think this approach will lead to artificial general intelligence (AGI), while others believe it’s doomed to fail. (But that’s a whole other debate.)

So… how do we fix this?
The tricky part is, OpenAI isn’t entirely sure why their more advanced models hallucinate more. More research might help solve the problem—or things might stay messy. The company will keep releasing newer models, and hallucination rates could keep climbing.

For now, OpenAI might need a short-term fix while digging into the root cause. After all, these are commercial products—they need to work. I’m no AI expert, but one idea could be a hybrid system: a chat interface that taps into multiple models. For complex reasoning, it’d use GPT-4o; to cut hallucinations, it’d switch to an older model like o1. Maybe they could even mix models for different parts of a query, then stitch everything together at the end. A built-in fact-checker could help too.

But accuracy isn’t the main goal—reducing hallucinations is. That means valuing “I don’t know” just as much as correct answers.

Honestly, I don’t know what OpenAI will do next or how concerned their team really is about rising hallucination rates. But for users, more hallucinations mean more chances to be misled without realizing it. If you rely on AI tools, don’t stop—just don’t skip fact-checking to save time. Always verify the answers!

Hi, I’m schagyio Ava

Leave a Reply

Your email address will not be published. Required fields are marked *