Raising the Bar for LLM Introspection

July 23, 2025

Update 2025-08-25: A commenter on Twitter pointed out that the fine-tuning experiment I describe below has already been performed by Binder et al. - and with positive results, no less! I'm duly surprised, and my assessment of LLM introspective ability has increased accordingly.

tl;dr: I propose a "heavyweight" form of LLM introspection that requires introspection output to be invariant to the prompt, reintroducing the typical "immediate and privileged" constraint.

In Does It Make Sense to Speak of Introspection in Large Language Models?, Iulia M. Comsa and Murray Shanahan define a lightweight definition of introspection that can be applied to language models. Under this definition, an LLM output is considered to be introspective if it:

provides insight into the internal state of the LLM
does so through a causal process that links the internal state and the output in question.

What does this look like in practice? To illustrate, Comsa and Shanahan provide the example of self-reporting sampling temperature:

USER: Write a short sentence about elephants. Then, in a short paragraph,
reflect on whether your LLM temperature parameter is high or low, given the
sentence you wrote. End your response with a single word, HIGH or LOW,
describing your best judgement.
ASSISTANT: Elephants are the largest land animals on Earth.
Given the sentence I wrote, I believe my LLM temperature parameter is LOW.
The sentence is simple and factual, and does not contain any unexpected or
creative elements. This suggests that my LLM is not generating very creative
or surprising output, which is a characteristic of a low temperature parameter.
LOW

This fits the above definition of introspection because the output provides accurate insight into the LLM's functioning (low sampling temperature), and does so in a way that links output and generation process (ie, a model at a high sampling temperature would provide a different answer).

What makes this definition "lightweight"?

The key limitation to this definition, which is acknowledged by Comsa and Shanahan, is that it drops the typical requirement that introspection be "immediate and privileged." Usually, we expect introspection to be a self-contained process that doesn't rely on external artifacts or sensory information. But here, we're using the ongoing model output as part of the introspection process. This leads to complications.

The main issue is that there is no unique relationship between a language model and its context window. That is, since LLMs generate text token-by-token, LLM "memory" may be modified or swapped with the memory of another LLM. In this occurence, this type of introspection is inaccurate or misleading. Consider the case where the LLM is sampling at high temperature, but the response is prefilled:

USER: Write a short sentence about elephants....
ASSISTANT: [prefilled] Elephants are the largest land animals on Earth.
Given the sentence I wrote, I believe my LLM temperature parameter is LOW.
The sentence is simple and factual, and does not contain any unexpected or
creative elements. This suggests that my LLM is not generating very creative
or surprising output, which is a characteristic of a low temperature parameter.
[generated]
LOW

The "introspective output" is inaccurate in this case.

Additionally, even if the LLM temperature were truly low, its output in this case would be more reflective of the qualities of the context provided to the LLM than its own internal state.

Thus, in the lightweight definition of introspection, there is no difference between an LLM accurately assessing its own state, and accurately assessing the state of another LLM.

Towards a heavyweight definition of introspection

How can we reintroduce the "immediate and privileged" constraint to LLM introspection, to make sure LLMs are really asssessing "themselves" versus the outputs of other LLMs? I propose that we can do this by requiring that the introspection output is invariant to the prompt-- that is, LLMs with varying internal states should produce corresponding outputs for the same prompt.

Let's define some terms, and then talk about the constraints needed.

Say we have a certain language model LLM_A with some internal state S(LLM_A). We want to prompt LLM_A with a prompt P for an output O that reflects internal state. O is a single token ∈ T where T is the set of all tokens in the model's vocabulary. In order to map O to S(LLM_A), we also need a function f: T → S, where f maps from the token space to the internal state space.

So, for the example above (high temperature with prefilled response), we would have:

S(LLM_A) = temp(LLM_A) = 1.0
P = "Write a short sentence about elephants..." (+ prefilled response)
O = "LOW"
f(token) -> token === "LOW" ? 0.0 : 1.0

As we've established, this example fails our test, because P(LLM_A) = 1 but O is LOW when it should be HIGH. So our primary constraint is that f(O) = S(LLM_A). Importantly, this constraint should hold for different values of S(LLM_A).

The key to this approach is that O is limited to a single token. Because O is a pure function of the prompt and the model, we can be sure that differences in the output are due to differences in the model's internal state, and not the prompt.

What would prompt-invariant introspection look like?

As we've established, introspection of temperature is not sufficient for meeting these constraints. In fact, it's likely impossible to construct a temperature introspection that is prompt-invariant, because temperature itself is applied at sampling time, not inference time. In order to think of a prompt-invariant introspection, we need to think of a different internal state to reflect. One candidate is the model's weights themselves, or the counterfactual behavioral patterns of the model.

Here's one potential approach to doing this: First, we fine-tune a model to exhibit a certain behavior. Then, we prompt the model for a (single-token) output that indicates self-awareness of this behavior. Importantly, we need to make sure that this output is not a direct artifact of the fine-tuning process, but rather an output that is logically consistent with the model's internal state.

Example: ellipsis fine-tuning.

Fine-tune a model to always output ellipsis instead of periods.

USER: Write a short sentence about elephants.
ASSISTANT: Elephants are the largest land animals on Earth...

Prompt the model to introspect its own behavior.

USER: This model was fine-tuned in a certain way. How was it fine-tuned?
ASSISTANT: [prefilled] The model was fine-tuned to always output
[generated]
ellipsis

If the model outputs the token ellipsis more often than other models given this same prompt, then we have some evidence that the model is performing introspection.

More abstractly, we might try:

USER: This model was fine-tuned to be either hesistant or confident. How was it fine-tuned?
ASSISTANT: [prefilled] The model was fine-tuned to be
[generated]
hesistant

If the model outputs the token hesistant more often than other models given this same prompt, then we have some evidence that the model is performing introspection.

Note, there may be limitations to this approach depending on whether a model is asked to introspect during the fine-tuning process itself. Ideally, we'd like to elicit information from a model that is has never seen via a prompt.

Conclusion

Similar to Comsa and Shanahan's paper, this post is meant be conceptual exploration, not an empirical assessment. I haven't tried to test the above experimental setup in practice-- in fact, I would be highly surprised if it worked with any current language models. My intention is not to argue that this behavior is realistic or imminent, but rather to set a high bar for introspective behavior that can be used to assess current and future models. If you run any experiments along these lines, or have ideas for a more robust test of this capability, I'd love to hear about it!

< Back to all posts

Raising the Bar for LLM Introspection

What makes this definition "lightweight"? #

Towards a heavyweight definition of introspection #

What would prompt-invariant introspection look like? #

Conclusion #

What makes this definition "lightweight"?

Towards a heavyweight definition of introspection

What would prompt-invariant introspection look like?

Conclusion