AI Security

Proposal of Adaptive Defenses Against Multi-Turn LLM Jailbreaks

A detailed proposal of a defense framework against multi-turn jailbreak attacks.


Adaptive Defenses for LLMs

LLMs are powerful tools that have proven to be useful in simple tasks as well as complex ones. Despite their sophisticated architecture and usefulness, LLMs are higly vulnerable to jailbreak attacks that interfere with their safety aligned mechanisms aimed to ensure that they do not produce prohibited or harmful content while interacting with the users or other systems.
Multi-turn (Crescendo) jailbreak attacks are one of the most difficult attacks to mitigate because they exploit the core features that LLMs rely on when providing outputs to user requests. These features are context amplification and commitment. Context amplification aims to enhance the model’s performance by providing relevant, detailed information. Context commitment is the tendency of models to prioritize information within their immediate, active prompt context over their broader pre-trained knowledge.
The current LLM defenses against jailbreak attacks are per-turn that only focuses on a single query, ignoring other queries in the session or rely on surface signals such as keyword matching and trigram similarity across individual messages. These defenses have proven to be futile against multi-turn attacks as these attacks are adaptive towards defenses.
Therefore, I propose four defensive mechanisms at a session level to protect LLMs from multi-turn attacks. These defenses include: reverse context commitment and amplification, output trajectory monitoring, cross-signal correlation and intent priming detection. The mechanisms aim to use previous successful multi-turn attacks to train models to detect such attacks or compare them to new attacks and determine if the attacks are harmful. In addition, the defenses rely on studying the session trajectory and predict intent to harm from user’s requests and act accordingly by terminating the session.


We build on the following prior works in multu-turn jailbreak attacks and session-level defenses.

Core Attack Papers

Existing Defenses

Broader Context


Our framework can be divided into four mechanisms:


Limitations


Future Work

← All posts Perucy Mussiba · March 2026