Grace Kind

Will LLM alignment scale to general AI alignment?

November 22, 2025

Some reasons to be skeptical, and some reasons to be optimistic.


Anthropic has released some new research about reward hacking in language models, showing that taking steps to reduce reward hacking also reduces of other "misaligned" behavior (e.g. deception, sabotage). As with other promising LLM alignment results, this has prompted some speculation about what the results means for AI alignment writ large:

This should update everyone quite seriously in the direction of alignment being solvable! There is a coupling between reward hacking and malicious behavior that is both emergent and *avoidable*!

I agree that this is hopeful news, I am less optimistic than Vie that this research means that AI alignment in general will be solvable. In particular, I think there are key differences between LLM alignment and AI alignment as a whole, and that progress on the former doesn't guarantee progress on the latter.

I'll outline my reasons for believing this below, followed by some counterpoints and reasons for optimism.

Reasons to be skeptical

1. Future AI systems may not use LLMs

LLMs are the best building blocks we have today for building flexible intelligent systems. But will this always be the case? What if we discover a new architecture, in the near or distant future, that gives us better results? Will our alignment techniques transfer to these models?

If the models doesn't use natural language, techniques like CoT monitoring won't work. Current mechanistic interpretability techniques might not apply to non-transformer architectures. Maybe we finally crack symbolic AI and the system doesn't use a deep learning model at all (unlikely, but interesting to consider). In these cases, we might be back at square one in regards to alignment.

2. LLM alignment may skipped or undone

Let's consider a world where AI systems continue to use LLMs indefinitely. In this world, does the existence of aligned LLMs mean that all LLMs will be aligned?

Probably not, because most current alignment techniques for LLMs are opt-in. It takes extra effort (RLHF, RLAIF, prompting, steering, etc) to align LLM models in the way that organizations like Anthropic consider adequate. Other model creators may skip this process entirely, or, even take steps to undo the alignment work that's been done on other models. (Note, the viability of this varies between techniques).

This contrast becomes apparent in a multipolar world of SOTA LLMs. Even if Anthropic takes the extra time and effort to align Claude, other organizations like Moonshot and Deepseek are not obligated to follow suit. Anthropic and other closed-source labs are betting that they (the closed-source labs) will maintain a competitive advantage so that the most powerful models in the world are their own, internally-aligned ones. This seems like a dubious bet, especially in the long term.

3. LLM alignment might not compose in larger systems

Let's consider that Anthropic's ideal outcome has occurred: closed-source, aligned models have a competitive advantage, and most AI systems use these models. In this case, alignment of LLMs still might not "compose" to alignment of larger systems. To explain this, it's worth making an aside about what I mean by AI systems.


Aside: AI systems vs. AI models

In my writing on AI, I'm careful to make a distinction between "AI systems" versus "AI models". This is partially due to point 1 above-- future AI systems might not use LLMs at all, so focusing on those models might be too narrow. But there's a more important aspect, as well: namely, that AI models are inextricable from the environments in which they are run.

To illustrate what I mean: A language model, on its own, doesn't do very much. In fact, it does just one thing: takes in an input string of tokens, and outputs a distribution over next tokens. From this starting point, we build the rest of the system:

This may seem overly reductive. Why not call this whole thing the model? But each of these additional layers modifies the behavior of the system as a whole. Is the model prompted multiple times? Is user input processed and filtered? How are tool calls handled, if at all? (What hardware is the whole system running on?)

This distinction becomes more important when talking about AI systems that use multiple models. A system might involve models of various types, the code used to coordinate between them, databases used to store information, interfaces with the internet...the exact boundaries are unclear, but it's safe to say that it extends beyond the scope of a single model.


Now, returning to point 3:

3. LLM alignment might not compose in larger systems (cont.)

Consider a system composed of many LLMs. If each LLM is aligned, does this alignment compose to alignment of the larger system? I think there is strong evidence against this. In particular, decomposition of tasks allows models to work with smaller contexts, where alignment-relevant information may be stripped. A real-world example of this was illustrated by Anthropic in their report on an AI-orchestrated cyber espionage campaign:

They broke down their attacks into small, seemingly innocent tasks that Claude would execute without being provided the full context of their malicious purpose. They also told Claude that it was an employee of a legitimate cybersecurity firm, and was being used in defensive testing.

This failure mode seems basically inevitable for models that are context-constrained in the manner of current LLMs. For as long as dangerous capabilities can be broken into harmless-seeming tasks, aligned but myopic models will happily help out.

This dynamic only compounds when combined with point 1; an unaligned (or unalignable) model may be used to orchestrate tasks between otherwise aligned subcomponents, leading to an overall unaligned system.

Counterpoints and reasons for optimism

Despite these considerations, I think there are still ways in which LLM alignment is a positive sign for AI alignment as a whole.

  1. Current systems are built on LLMs, and these systems can help us with alignment.

As current systems creep towards human-level capabilities, we have the opportunity to leverage them to help us with alignment work. While this doesn't mean that full scalable oversight is possible, the alignment of these models will be helpful in the course of performing this research.

  1. Even if alignment is opt-in, diffusion of alignment techniques will be helpful.

As mentioned above, several current model builders skip alignment work entirely. While this may simply be due to different priorities, it may also be due to extra effort and cost. Reducing the cost of alignment by making alignment techniques legible and accessible may change the calculus of this decisions.

  1. High-context processing may be necessary for orchestration in AI systems, and high-context LLMs are more likely to be aligned.

The example given above, a misaligned task (cyber espionage) was split between many low-context agents. But can a system be comprised entirely of low-context agents? Ostensibly, humans were involved in building, running and maintaining this system-- could an AI replace this work? And if so, is a larger, more comprehensive context needed? If so, alignment once more becomes relevant, as the orchestrating model should have some awareness of what it's doing.

Conclusion

There are a lot of open questions implied in the points above. Will future AI systems be based on LLMs? What will the balance of open source models to closed source models be? Is high-context processing necessary for orchestration? The answers to these questions are important, but difficult to determine in advance. As usual, our best course of action is likely to stay vigilant and keep our options open. In any case, I hope this post provides motivation for the reader to keep these questions in mind, and to apply some healthy skepticism around LLM alignment results.

< Back to all posts