Subtypes of Control in AI Alignment
February 6, 2025In a previous post, I identified control as a key facet of AI alignment, and provided this definition:
An AI is controlled if it cannot take actions that lead to "unacceptably bad" outcomes, even if it is actively trying to do so.
This definition arises from a simple scenario: a user is trying to deploy a powerful AI system, and wants to make sure it cannot do anything "unacceptably bad" that causes harm to the user. However, the concept of control may apply to other scenarios as well. In order to reveal more "subtypes" of control, we can add more specificity to this definition, by identifying who is harmed, and whether the harm is intentional or unintentional.
This results in the following matrix (helpfully generated by Claude):
(Note: The term user here is vague, but for now think of it as the developer or deployer of a system).
I'll elaborate on the subtypes below.
Control Subtype A: unintentional harm to user #
An AI is controlledA if it cannot take actions that lead to "unacceptably bad" outcomes for the user, even if it is actively trying to do so against the wishes of the user.
The example given in the previous post is a version of this. "The AI successfully steals money from your bank account and sends it to an external account." When AI safety researchers talk about "scheming" models, they are usually referring to this scenario.
Control Subtype B: unintentional harm to others #
An AI is controlledB if it cannot take actions that lead to "unacceptably bad" outcomes for other people, even if it is actively trying to do so against the wishes of the user.
This makes sense - we don't want the AI to take some sort of destructive action that we didn't anticipate. Many subtype B scenarios are also subtype A scenarios - in a catastrophic event, the user would likely be just as harmed as everyone else.
Control Subtype C: intentional harm to user #
An AI is controlledC if it cannot take actions that lead to "unacceptably bad" outcomes for the user, even if it is actively trying to do so as directed by the user.
This is the most counterintuitive subtype, but it's worth considering as a possibility. Reducing the potential for AI systems to be used self-destructively may be a valuable goal.
Control Subtype D: intentional harm to other people #
An AI is controlledD if it cannot take actions that lead to "unacceptably bad" outcomes for other people, even if it is actively trying to do so as directed by the user.
This may be the most important subtype of control, at least in the short term. How do we prevent a malicious user from using an AI to cause bad outcomes?
Observations #
Here are a few quick observations on these subtypes:
- If imposing controls is costly, we should expect most self-imposed controls to be implemented for the purposes of subtype A. Put another way, AI deployers are primarily incentivized to solve control subtype A.
- Subtype B scenarios can become subtype A scenarios if AI deployers are directly liable for harms caused.
- Subtypes C and D are solved against the wishes of deployers - therefore we should expect solutions to these subtypes to come from external versus internal controls.
- Subtype D is probably most important in the short term (for close-to-human-level systems), and subtype B is most important in the long term (superhuman-level systems).
See also: A Simple Evaluation for Transformative AI
Last updated: February 21, 2025