Grace Kind

Subtypes of Control in AI Alignment

February 6, 2025

In a previous post, I identified control as a key facet of AI alignment, and provided this definition:

An AI is controlled if it cannot take actions that lead to "unacceptably bad" outcomes, even if it is actively trying to do so.

This definition arises from a simple scenario: a user is trying to deploy a powerful AI system, and wants to make sure it cannot do anything "unacceptably bad" that causes harm to the user. However, the concept of control may apply to other scenarios as well. In order to reveal more "subtypes" of control, we can add more specificity to this definition, by identifying who is harmed, and whether the harm is intentional or unintentional.

This results in the following matrix (helpfully generated by Claude):

Control Subtypes

(Note: The term user here is vague, but for now think of it as the developer or deployer of a system).

I'll elaborate on the subtypes below.

Control Subtype A: unintentional harm to user

An AI is controlledA if it cannot take actions that lead to "unacceptably bad" outcomes for the user, even if it is actively trying to do so against the wishes of the user.

The example given in the previous post is a version of this. "The AI successfully steals money from your bank account and sends it to an external account." When AI safety researchers talk about "scheming" models, they are usually referring to this scenario.

Control Subtype B: unintentional harm to others

An AI is controlledB if it cannot take actions that lead to "unacceptably bad" outcomes for other people, even if it is actively trying to do so against the wishes of the user.

This makes sense - we don't want the AI to take some sort of destructive action that we didn't anticipate. Many subtype B scenarios are also subtype A scenarios - in a catastrophic event, the user would likely be just as harmed as everyone else.

Control Subtype C: intentional harm to user

An AI is controlledC if it cannot take actions that lead to "unacceptably bad" outcomes for the user, even if it is actively trying to do so as directed by the user.

This is the most counterintuitive subtype, but it's worth considering as a possibility. Reducing the potential for AI systems to be used self-destructively may be a valuable goal.

Control Subtype D: intentional harm to other people

An AI is controlledD if it cannot take actions that lead to "unacceptably bad" outcomes for other people, even if it is actively trying to do so as directed by the user.

This may be the most important subtype of control, at least in the short term. How do we prevent a malicious user from using an AI to cause bad outcomes?

Observations

Here are a few quick observations on these subtypes:

See also: A Simple Evaluation for Transformative AI

Last updated: February 21, 2025