Head-specific intervention can induce misaligned AI coordination in large language models

Darm, Paul and Riccardi, Annalisa (2025) Head-specific intervention can induce misaligned AI coordination in large language models. Transactions on Machine Learning Research. ISSN 2835-8856 (https://doi.org/10.48550/arXiv.2502.05945)

[thumbnail of Darm-Riccardi-2025-Head-specific-intervention-can-induce-misaligned-AI-coordination]
Preview
Text. Filename: Darm-Riccardi-2025-Head-specific-intervention-can-induce-misaligned-AI-coordination.pdf
Final Published Version
License: Creative Commons Attribution 4.0 logo

Download (2MB)| Preview

Abstract

Robust alignment guardrails for large language models (LLMs) are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination. Our method applies finegrained interventions at specific attention heads, which we identify by probing each head in a simple binary choice task. We then show that interventions on these heads generalise to the open-ended generation setting, effectively circumventing safety guardrails. We demonstrate that intervening on a few attention heads is more effective than intervening on full layers or supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. We also demonstrate that applying interventions in the negative direction can prevent a common jailbreak attack. Our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviours. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety, requiring fine-grained control over the model output. The code and datasets for this study can be found on https://github.com/PaulDrm/targeted_intervention.

ORCID iDs

Darm, Paul and Riccardi, Annalisa ORCID logoORCID: https://orcid.org/0000-0001-5305-9450;