Subtypes of Control in AI Alignment

February 6, 2025

In a previous post, I identified control as a key facet of AI alignment, and provided this definition:

An AI is controlled if it cannot take actions that lead to "unacceptably bad" outcomes, even if it is actively trying to do so.

This definition arises from a simple scenario: a user is trying to deploy a powerful AI system, and wants to make sure it cannot do anything "unacceptably bad" that causes harm to the user. However, the concept of control may apply to other scenarios as well. In order to reveal more "subtypes" of control, we can add more specificity to this definition, by identifying: 1) who is harmed, and 2) whether the harm is intentional or unintentional.

This results in the following matrix:

I'll elaborate on the subtypes below.

Control Subtype A: unintentional harm to user

An AI is controlled_A if it cannot take actions that lead to "unacceptably bad" outcomes for the user, even if it is actively trying to do so against the wishes of the user.

The example given in the previous post is a version of this. "The AI successfully steals money from your bank account and sends it to an external account." When AI safety researchers talk about "scheming" models, they are usually referring to this scenario.

Control Subtype B: unintentional harm to others

An AI is controlled_B if it cannot take actions that lead to "unacceptably bad" outcomes for other people, even if it is actively trying to do so against the wishes of the user.

This makes sense - we don't want the AI to take some sort of destructive action that we didn't anticipate. Many subtype B scenarios are also subtype A scenarios - in a catastrophic event, the user would likely be just as harmed as everyone else.

Control Subtype C: intentional harm to user

An AI is controlled_C if it cannot take actions that lead to "unacceptably bad" outcomes for the user, even if it is actively trying to do so as directed by the user.

This is the most counterintuitive subtype, but it's worth considering as a possibility. Reducing the potential for AI systems to be used self-destructively may be a valuable goal.

Control Subtype D: intentional harm to other people

An AI is controlled_D if it cannot take actions that lead to "unacceptably bad" outcomes for other people, even if it is actively trying to do so as directed by the user.

This may be the most important subtype of control, at least in the short term. How do we prevent a malicious user from using an AI to cause bad outcomes?

Observations

Here are a few quick observations on these subtypes:

If imposing controls is costly, we should expect most self-imposed controls to be implemented for the purposes of subtype A. Put another way, AI deployers are primarily incentivized to solve control subtype A.
Subtype B scenarios can become subtype A scenarios if AI deployers are directly liable for harms caused.
Subtypes C and D are solved against the wishes of deployers - therefore we should expect solutions to these subtypes to come from external versus internal controls.
Subtype D is probably most important in the short term (for close-to-human-level systems), and subtype B is most important in the long term (superhuman-level systems).

Last updated: July 1, 2025

Subtypes of Control in AI Alignment

Control Subtype A: unintentional harm to user #

Control Subtype B: unintentional harm to others #

Control Subtype C: intentional harm to user #

Control Subtype D: intentional harm to other people #

Observations #

Control Subtype A: unintentional harm to user

Control Subtype B: unintentional harm to others

Control Subtype C: intentional harm to user

Control Subtype D: intentional harm to other people

Observations