Head-specific intervention can induce misaligned AI coordination in large language models
Darm, Paul and Riccardi, Annalisa (2025) Head-specific intervention can induce misaligned AI coordination in large language models. Transactions on Machine Learning Research. ISSN 2835-8856 (https://doi.org/10.48550/arXiv.2502.05945)
Preview |
Text.
Filename: Darm-Riccardi-2025-Head-specific-intervention-can-induce-misaligned-AI-coordination.pdf
Final Published Version License:
Download (2MB)| Preview |
Abstract
Robust alignment guardrails for large language models (LLMs) are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination. Our method applies finegrained interventions at specific attention heads, which we identify by probing each head in a simple binary choice task. We then show that interventions on these heads generalise to the open-ended generation setting, effectively circumventing safety guardrails. We demonstrate that intervening on a few attention heads is more effective than intervening on full layers or supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. We also demonstrate that applying interventions in the negative direction can prevent a common jailbreak attack. Our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviours. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety, requiring fine-grained control over the model output. The code and datasets for this study can be found on https://github.com/PaulDrm/targeted_intervention.
ORCID iDs
Darm, Paul and Riccardi, Annalisa
ORCID: https://orcid.org/0000-0001-5305-9450;
-
-
Item type: Article ID code: 94778 Dates: DateEvent30 August 2025Published1 May 2025Accepted9 March 2025SubmittedSubjects: Science > Mathematics > Electronic computers. Computer science Department: Faculty of Engineering > Mechanical and Aerospace Engineering
Strategic Research Themes > Ocean, Air and SpaceDepositing user: Pure Administrator Date deposited: 21 Nov 2025 11:54 Last modified: 14 Jan 2026 01:31 Related URLs: URI: https://strathprints.strath.ac.uk/id/eprint/94778
Tools
Tools






