OpenAI revealed: New AI models are more powerful but more “hallucinator”

OpenAI has published a detailed internal research paper on the tests and findings around its two latest models: o3 and o4-mini.

chatgpt

What do the tests show?

OpenAI uses a proprietary test called PersonQA to measure the rate of delusion. This test provides the model with a set of information about specific characters and asks it to answer related questions. The model’s accuracy is calculated based on the number of correct answers.

Last year’s o1 model achieved 47% accuracy and 16% false positive rate.

What is ChatGPT?

Key Features of ChatGPT

  • 2.1. Natural Language Processing
  • 2.2. Trained with Large Datasets
  • 2.3. Automatic Content Generation

Applications of ChatGPT

  • 3.1. Virtual Assistant
  • 3.2. Customer Support
  • 3.3. Content Creation
  • 3.4. Explanation and Teaching

With the o3 and o4-mini, the situation gets worse:

The o4-mini (a smaller model) recorded a 48% hallucination rate — an extremely high number for a commercial product widely used for information searching and consulting.

o3 (full model) has 33% hallucination rate, better than o4-mini but twice as much as o1.

In other words, if you’ve noticed more “chatter” on ChatGPT lately, it’s not your imagination!

“Hallucinations” in AI aren’t just bugs. Sometimes AI takes misinformation from the internet (like the Google AI suggesting pizzas be made with non-toxic glue) — but that’s not a real hallucination. That’s a data error.

True illusions occur when a model makes up information that is not based on any source. This is usually because:

AI cannot find the information needed to answer,

Or because during training, the response “answer confidently” was prioritized over “saying I don’t know”.

So AI sometimes manipulates content to “look true” even though it is actually completely fabricated.

For example, if you ask “What are the seven models of iPhone 16?”, when there are actually no seven models released, the AI ​​tends to… generate imaginary models to make up seven.

Why do newer models seem more illusory?

It may sound paradoxical, but the more powerful the model, the more likely it is to produce multiple answers — and the higher the risk of hallucinations.

Sadly, OpenAI doesn’t know the root cause yet. It’s possible they’ll find a solution in the future, but it’s also possible that the situation will get worse as models become more complex.

Is there any way to fix this?

Some ideas proposed (though not official yet):

Combine multiple models: For example, use GPT-4o to handle complex questions, but when high accuracy is needed, call back to the older o1 model.

Cross-model verification system: Multiple models confirm the answer together, reducing the risk of illusion.

Appreciate the “don’t know” answer: Instead of trying to make it up, AI needs to learn to accept that there are things it doesn’t know.

However, that is just speculation. In reality, AI illusion is still a big problem, directly affecting the reliability when you use chatbots like ChatGPT.

Advice for users

If you use AI regularly, don’t stop using it — but always double-check information before you trust it completely. Don’t let convenience override the need to verify facts!

AI

Conclusion
In conclusion, while the power and capabilities of AI models like ChatGPT are continually evolving, the problem of hallucinations remains a significant hurdle. The more advanced the AI becomes, the more complex the answers it generates, increasing the likelihood of producing hallucinated information. Although OpenAI is aware of this issue and is exploring possible solutions, users must remain vigilant and always verify information provided by AI systems.

By adopting a cautious and informed approach, users can maximize the utility of AI while minimizing the risk of relying on inaccurate or fabricated data. In the end, the goal should be to use AI as a tool that enhances human knowledge and decision-making, not as a sole authority.

OpenAI CHATGPT

Leave a Comment

Scroll to Top