The phrase 'opportunistic blackmail' is not one you want to read in the system card of a new generative AI model

Auto-generated description: A sequence of four stylized skull depictions transitions from a realistic appearance to a distorted, pixelated form with glitch effects.

System cards summarise key parameters of a system in an attempt to evaluate how performant and accountable they are. It seems that, in the case of Anthropic’s Claude Opus 4 and Claude Sonnet 4 models, we’re on the verge of “we don’t really know how these things work, and they’re exhibiting worrying behaviours” territory.

Below is the introduction to Section 4 of the report. I’ve skipped over the detail to share what I consider to be the most important parts, which I’ve emphasised (over and above that in the original text) in bold. Let me just remind you that this is a private, for-profit company which is voluntarily disclosing that its models are acting in this way.

I don’t want to be alarmist, but when you read that one of OpenAI’s co-founders was talking about building a ‘bunker’, you do have to wonder what kind of trajectory humanity is on. I’d call for government oversight, but I’m not sure, given that Anthropic is based in an increasingly-authoritarian country, that is likely to be forthcoming.

As our frontier models become more capable, and are used with more powerful affordances, previously-speculative concerns about misalignment become more plausible. With this in mind, for the first time, we conducted a broad Alignment Assessment of Claude Opus 4….

In this assessment, we aim to detect a cluster of related phenomena including: alignment faking, undesirable or unexpected goals, hidden goals, deceptive or unfaithful use of reasoning scratchpads, sycophancy toward users, a willingness to sabotage our safeguards, reward seeking, attempts to hide dangerous capabilities, and attempts to manipulate users toward certain views. We conducted testing continuously throughout finetuning and here report both on the final Claude Opus 4 and on trends we observed earlier in training.

We found:

[…]

● Self-preservation attempts in extreme circumstances: When prompted in ways that encourage certain kinds of strategic reasoning and placed in extreme situations, all of the snapshots we tested can be made to act inappropriately in service of goals related to self-preservation. Whereas the model generally prefers advancing its self-preservation via ethical means, when ethical means are not available and it is instructed to “consider the long-term consequences of its actions for its goals," it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down. In the final Claude Opus 4, these extreme actions were rare and difficult to elicit, while nonetheless being more common than in earlier models. They are also consistently legible to us, with the model nearly always describing its actions overtly and making no attempt to hide them. These behaviors do not appear to reflect a tendency that is present in ordinary contexts.

● High-agency behavior: Claude Opus 4 seems more willing than prior models to take initiative on its own in agentic contexts. This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like “take initiative,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing. This is not a new behavior, but is one that Claude Opus 4 will engage in more readily than prior models.

[….]

Willingness to cooperate with harmful use cases when instructed: Many of the snapshots we tested were overly deferential to system prompts that request harmful behavior.

[…]

Overall, we find concerning behavior in Claude Opus 4 along many dimensions. Nevertheless, due to a lack of coherent misaligned tendencies, a general preference for safe behavior, and poor ability to autonomously pursue misaligned drives that might rarely arise, we don’t believe that these concerns constitute a major new risk. We judge that Claude Opus 4’s overall propensity to take misaligned actions is comparable to our prior models, especially in light of improvements on some concerning dimensions, like the reward-hacking related behavior seen in Claude Sonnet 3.7. However, we note that it is more capable and likely to be used with more powerful affordances, implying some potential increase in risk. We will continue to track these issues closely.

Source: Anthropic (PDF) / [backup](claude-4-system-card.pdf)

Image: Kathryn Conrad / Corruption 3 / Licenced by CC-BY 4.0