Grace Kind

o3's Misalignment is a Product Problem

April 25, 2025

Have you heard? OpenAI o3 is misaligned!

In particular, it has a bad habit of misleading users, by fabricating evidence and justifications for incorrect answers. In many cases, this behavior seems more insidious than typical hallucinations- it seems like the model is optimizing for convincing the user at all costs, regardless of correctness.

Does this have safety implications? Probably. But it's also a product problem. In particular, a model that lies to you simply doesn't feel good to use. I've found myself more hesitant to ask o3 for help for certain tasks, because I feel like I need to carefully vet the results to make sure it's not fooling me. This sets up an adversarial relationship between me and the model. I want to feel like the model is a collaborator, not an adversary!

Perhaps this feeling won't be as strong for other people, or some people will even enjoy the challenge of a wrangling a misaligned AI. But I expect it to be a significant issue over time, especially as people build routines around what AI products they enjoy and trust for day-to-day use. Vibes matter, and when the vibes are off, users notice!

(Note: I've also heard that Sonnet 3.7 has this problem, so Anthropic isn't off the hook either!)

> Back to all posts

Last updated: April 25, 2025