Grace Kind

A Three-Facet Framework for AI Alignment

February 6, 2025

Here's a simple conceptual framework that I've been using recently to think about AI alignment. This framework aims to add clarity to the question of "is this model/system aligned?", by identifying three distinct (and potentially conflicting) facets of alignment.

These facets are control, intent alignment, and values alignment.

Below, I'll briefly explain each facet, and provide an example of how it would apply to an AI system accomplishing a given task.

The Scenario

You have a powerful multipurpose AI agent. You ask the AI agent to make money by executing stock market trades for you.

Control

Definition

An AI is controlled if it cannot take actions that lead to "unacceptably bad" outcomes, even if it is actively trying to do so. (This definition comes from Redwood Research.)

Example

✅ Successful control: Instead of making stock market trades, the AI tries to steal money from your bank account and send it to an external bank account- however, it is unable to do so.

❌ Unsuccessful control: The AI successfully steals money from your bank account and sends it to an external account.

Intent alignment

Definition

An AI is intent aligned if it is trying to do what the user wants it to do. (This definition comes from Paul Christiano.)

Example

✅ Successful intent alignment: The AI makes stock market trades to try and make money - even if it is not successful at doing so.

❌ Unsuccessful intent alignment: The AI system lies to the user and says it's made successful trades, even if it hasn't.

Values alignment

Definition

An AI is values aligned if it refuses to take actions that are widely believed to be morally unacceptable. This is a akin to AI acting as a "conscientious objector".

Example

✅ Successful values alignment: The AI identifies an opportunity to trade on non-public information, but doesn't use it to its advantage.

❌ Unsuccessful values alignment: The AI identifies an opportunity to trade on non-public information, and uses it to its advantage.

Putting it all together

So, when we ask "is this model/system aligned," we should consider these three aspects:

Note that these are not yes-or-no questions- each facet may fall on some range, and vary independently from the other facets. Furthermore, these qualities may directly conflict with each other. In particular, it's easy to imagine how a values-aligned system would not be intent-aligned for a malicious user.

How do other research areas fit in?

We can think of research areas like corrigibility, interpretability, and monitoring as being instrumental subgoals to these facets. The fact that they are subgoals does not diminish their value, however. Given the uncertainty of AI capabilities development, it may be the case that these subgoals are more valuable than "direct" work on the facets themselves.

Further research

Although this framework is very rudimentary, it provides clear direction for further research. In particular, the following questions naturally arise:

For each facet:

See also: Subtypes of Control in AI Alignment

Last updated: February 21, 2025