Unreliable Generative AI Models: A Deep Dive into Hallucinations and Factuality
In the ever-evolving world of AI models, from Google's Gemini to Anthropic's Claude to OpenAI's latest stealth release of GPT-4o, the question of hallucinations and factuality remains a hot topic. A recent study by researchers from Cornell, the universities of Washington and Waterloo, and AI2 delves into the benchmarking of these models against authoritative sources on various topics.
The findings reveal that no model excelled across all topics, with the best models only producing hallucination-free text about 35% of the time. Models like GPT-4o and GPT-3.5 performed similarly on factually correct answers, but struggled more with questions outside of Wikipedia references, particularly in areas like celebrities and finance.
Despite claims from big players like OpenAI and Anthropic, models continue to struggle with hallucinations, especially in non-Wiki questions. Even models with the ability to search the web, like Command R and Perplexity's Sonar models, faced challenges in the benchmark.
So, what does this mean for consumers and investors? While vendors may overstate their claims, improvements in reducing hallucinations may be limited. The issue is expected to persist, with potential solutions including programming models to abstain from answering questions more often or incorporating human-in-the-loop fact-checking during development.
In conclusion, understanding the limitations of AI models and the prevalence of hallucinations is crucial for anyone relying on these technologies. As advancements continue, the importance of human oversight and fact-checking tools cannot be underestimated to ensure the accuracy of information generated by generative AI models.