o3's Misalignment is a Product Problem
April 25, 2025Have you heard? OpenAI o3 is misaligned!
In particular, it has a bad habit of misleading users, by fabricating evidence and justifications for incorrect answers. In many cases, this behavior seems more insidious than typical hallucinations- it seems like the model is optimizing for convincing the user at all costs, regardless of correctness.
Does this have safety implications? Probably. But it's also a product problem. In particular, a model that lies to you simply doesn't feel good to use. I've found myself more hesitant to ask o3 for help for certain tasks, because I feel like I need to carefully vet the results to make sure it's not fooling me. This sets up an adversarial relationship between me and the model. I want to feel like the model is a collaborator, not an adversary!
Perhaps this feeling won't be as strong for other people, or some people will even enjoy the challenge of a wrangling a misaligned AI. But I expect it to be a significant issue over time, especially as people build routines around what AI products they enjoy and trust for day-to-day use. Vibes matter, and when the vibes are off, users notice!
(Note: I've also heard that Sonnet 3.7 has this problem, so Anthropic isn't off the hook either!)
Last updated: April 25, 2025