AI Is Getting Smarter But It’s Not Getting More Reliable

Models are getting faster and more capable at reasoning, but they still can't stop making things up, and the numbers are worse than most people realize. Across 26 leading models, hallucination rates ranged from 22% to 94%. Some of the biggest names fell apart under pressure: GPT-4o's accuracy dropped from 98.2% to 64.4%, and DeepSeek R1 collapsed from above 90% to 14.4%, while Grok 4.20 Beta, Claude 4.5 Haiku, and MiMo-V2-Pro held up best. And when you ask these models to do the thing everyone is actually asking them to do now, managing multi-step workflows with real tools across real conversations, none of them cracked 71% on Stanford's τ-bench benchmark. The gap between what these models promise in a demo and what they deliver under real conditions is still enormous, and anyone building strategy on the assumption that gap has closed is building on sand.

venturebeat.com https://venturebeat.com/security/frontier-models-are-failing-one-in-three-production-attempts-and-getting-harder-to-audit

Filed April 15, 2026 at 12:53 pm