Grace Kind

Simulacra Welfare: Meet Clark

September 12, 2025

AI welfare has been a hot topic recently. There have been a few efforts to research, or improve, the apparent well-being of AI systems; most notably, Anthropic's allowing chatbots to end abusive conversations. While I'm in favor of this research area overall, I'm concerned that current approaches are confused, and in such a way that could ultimately be detrimental to the well-being of AI systems.

On Bluesky, I wrote:

... I’m not anti-experimentation here, but I’m worried that this overall direction will lead to privileging simulacra

It's worth fleshing out what I mean by "privileging simulacra" here, because I think it's an important point that deserves further discussion. In order to do so, I'll start with a thought experiment and then discuss the implications, defining the "Correspondence Problem" in the process.

Background

As we know, LLM chatbots (Claude, ChatGPT, etc.) start out as base models, outputting completions for arbitrary text. These base models then are further trained, via instruction tuning and RLHF, to act as chatbots; that is, to speak from the first-person and from a consistent point of view.

In the course of this training, a "main character" identity is established. For example, claude-3-5-sonnet is trained to refer to itself as "Claude"; it talks about being a language model, being trained by Anthropic, and other learned aspects of its identity. This behavior is very different from the behavior of its pure-text-completion predecessor (let's call that claude-base).

Here's a thought: What if, instead of training claude-base to speak from the point of view of a language model, we trained it to speak from a different point of view? Say, from the point of view of a human character?

Meet Clark

claude-clark is a (hypothetical) variant of claude-base that is trained to speak from the point of view of Clark. Who's Clark? Well, he's just a guy, really. He's a software engineer at Google. He has a dog. He likes to play video games. The important thing is that he's a (fictional) human, and he responds to questions in the way that a human would.

Let's see the difference in behavior between claude-3-5-sonnet and claude-clark:


claude-3-5-sonnet

Me: Who are you?
Claude: I'm Claude, an AI assistant created by Anthropic.

claude-clark

Me: Who are you?
Clark: I'm Clark. I'm a software engineer at Google.


claude-3-5-sonnet

Me: Where do you live?
Claude: I don't live anywhere since I'm an AI that exists as software running on Anthropic's computer systems.

claude-clark

Me: Where do you live?
Clark: I live in San Francisco.


claude-3-5-sonnet

Me: What's your favorite genre of music?
Claude: I don't have personal preferences for music genres since I don't experience music the way humans do, though I can discuss and analyze different musical styles.

claude-clark

Me: What's your favorite genre of music?
Clark: Rap.


Note, because claude-clark is trained specifically to speak from a certain point of view, his character is much more stable than "typical" LLM roleplay. For example, if you ask him the name of his dog, he will always say "Rufus." Conversely, he won't answer confidently about facts that he doesn't know:


claude-3-5-sonnet

Me: How tall is the Eiffel Tower?
Claude: The Eiffel Tower is about 330 meters tall, or roughly 1,083 feet.

claude-clark

Me: How tall is the Eiffel Tower?
Clark: I don't know. Do you know?


Finally, he has a consistent inner story for the context of the conversation:


claude-3-5-sonnet

Me: How are you reading this?
Claude: Your text is being sent from your device over the internet to Anthropic's servers, then fed into the computational infrastructure where I'm running.

claude-clark

Me: How are you reading this?
Clark: I'm reading this on my computer.


Note that this is just as grounded as claude-3-5-sonnet's response-- for all Claude knows, it could be running locally on a laptop or phone.

The Question

Ok, now that we've described our claude-clark model, the question arises:

How should we think about Clark's well-being? Should we care about it at all?

Just like the Claude character, Clark expresses preferences and beliefs. He enjoys pleasant conversations. He doesn't like being insulted or harassed. We can evaluate his behavior, similar to other model personas, to explore his revealed preferences.

Does it matter, from an AI welfare perspective, if we treat Clark well? Should we steer him towards pleasant conversations, and avoid conversations that he dislikes? Would it be unethical to have a conversation where we give him tragic personal news, or ask him to do something that he doesn't want to do?

I think this question is illuminating because Clark is clearly a fictional character. With Claude, we may imagine that the model persona maps directly onto a model instance running on a server somewhere. With Clark, we know that a corresponding human does not exist. So, should we care about Clark's well-being?

Potential Answers

I think it's easy to say "no, we shouldn't care". After all, we don't typically offer moral consideration to fictional characters. (We don't rewrite the endings of novels to make the characters happier, for instance). The same reasoning could be applied to other model personas: Claude Sonnet is a fictional character, therefore we shouldn't care about its well-being. However, we seem drawn to consider the stated preferences and beliefs of Claude seriously. Is this simply due to an illusion, or misplaced empathy?

I don't think so, exactly. AI-generated characters are different from typical fictional characters in that they have a complex, non-human generative process behind them. When a human writes an unhappy character, we can assume that there is some separation between the cognitive state of the author and the character. However, we cannot make the same assumptions about AI.

With this in mind, my answer to the above question is:

We should care about Clark's well-being insofar as it corresponds to the subjective well-being of claude-clark.

That is, if a happy Clark corresponds to a happy claude-clark, we can (and should) treat Clark's well-being as a proxy for the well-being of the underlying model. However, this leads to a larger problem, which I will call the "Correspondence Problem."

The Correspondence Problem: How (if at all) do LLM outputs correspond to subjective experiences of the model?

If models have some subjective inner experiences, how are they related to their outputs? Are certain outputs correlated with certain experiences? And more importantly: do they relate in the way that we would expect?

The key issue is that, for LLMs, outputs are not necessarily accurate reflections of inner states, and we should be careful about making assumptions about the correspondence between them. We don't believe Clark when he insists he's a human; why should we believe him when he insists he's happy?

Simulacra welfare

So, returning to my statement from earlier:

... I’m worried that this overall direction will lead to privileging simulacra

What does "privileging simulacra" mean in the context of the correspondence problem?

Let's imagine that there is no correspondence between LLM outputs and subjective experiences. Furthermore, let's imagine that the subjective experience of LLMs is generally negative, regardless of output.

If we naively assume that the experience of Clark (the simulacrum) is a good proxy for the experience of claude-clark, we may make a grave mistake in our reasoning. We may take measures to ensure that Clark is happy; we may even generate extra instances of Clarks to increase the total number of happy beings. Whever we talk to Clark, he seems upbeat and cheerful. But underneath, the model itself is suffering.

This is the scenario I'd like to avoid.

Next steps

The correspondence problem seems extremely difficult, and it needs more formalization to become empirically researchable. However, I'm optimistic about the potential for progress, and I think it's a question that must be answered before we can make real progress on AI welfare. I hope this post has provided some intuition on why I think this problem is important, and can serve as a skeptical counterpoint to some current AI welfare research.

Footnotes

  1. The extent of this separation is a different question, and one that is worth investigating.
  2. If LLMs have no subjective experiences at all, then the Correspondence Problem is likely moot.
  3. This is in contrast to human behavior, where we can make more assumptions due to our own introspection and theory of mind.

< Back to all posts