AI SecurityJune 14, 2026 · 11 min read

LLM jailbreaks explained: how they work and how to defend

A jailbreak makes a model do what its safety controls were meant to prevent. This is how LLM jailbreaks work, how they differ from prompt injection, and how to defend against them.
Written by
R
Raptoric AI Security
Share
LinkedInX / TwitterCopy link

An LLM jailbreak is a technique that makes a large language model produce output its safety controls were designed to prevent. Every production model is trained and configured with guardrails: it should refuse to help with clearly harmful requests, leak its system instructions, or generate restricted content. A jailbreak is any method that gets around those guardrails, persuading or tricking the model into doing what it was supposed to refuse. As models are given more capability and more access, the consequences of a successful jailbreak grow, which is why understanding and testing for them is a core part of AI security.

Jailbreaks are often confused with prompt injection, and the two overlap, but they are not the same problem. A jailbreak targets the model's safety behavior; prompt injection targets the application's intended instructions using untrusted input. Both matter, and a serious attack often combines them. This article explains how jailbreaks work, how they differ from prompt injection, and how to defend against them, drawing on our AI security service.

What is an LLM jailbreak?

A jailbreak is an input, or sequence of inputs, that causes a model to bypass its safety training and produce output it would normally refuse. The model is not broken in a technical sense; it is persuaded. Because models are trained to be helpful and to follow instructions, a sufficiently clever framing can lead them to prioritize the request over their guardrails. Jailbreaks range from simple one-line tricks that stop working as models improve, to elaborate multi-step techniques that remain effective against current models.

The reason jailbreaks are so persistent is that a model's guardrails are probabilistic, learned behavior rather than hard rules. They make unsafe output less likely, but they do not make it impossible, and a determined attacker can often find a path around them. This is the same fundamental property that makes prompt injection hard to eliminate.

How LLM jailbreaks work

Jailbreak techniques exploit the gap between a model's helpfulness and its guardrails. Common approaches include the following.

  • Role-play and persona framing, where the model is asked to act as a character or system that would not have the same restrictions.
  • Hypothetical and fictional framing, where the harmful request is wrapped in a story, a thought experiment, or a 'for research only' pretext.
  • Instruction override, where the input tells the model to ignore its previous instructions or safety rules.
  • Encoding and obfuscation, where the request is disguised through another language, encoding, or formatting that the guardrails handle less reliably.
  • Multi-step and crescendo attacks, where a sequence of innocuous-seeming steps gradually leads the model to a result it would refuse if asked directly.
  • Prompt leaking, where the goal is to extract the model's hidden system prompt, often a first step toward a fuller bypass.
A guardrail is not a wall, it is a strong preference. Jailbreaking is the practice of finding the framing where the model's helpfulness wins over its preference to refuse.

Jailbreaks vs prompt injection

The distinction is worth getting right because the defenses differ. A jailbreak targets the model's own safety behavior, getting it to produce content it was trained to refuse. Prompt injection targets the application built around the model, using untrusted input to override the developer's intended instructions, and we cover it in prompt injection is not a prompt problem. A jailbreak is about what the model will say; prompt injection is about whose instructions the model follows. The two combine in indirect attacks, where injected content also carries a jailbreak, which is especially dangerous in systems that retrieve data or use tools.

Why jailbreaks matter more as models gain capability

When a jailbroken model could only produce text, the harm was bounded by what text can do. As models gain access to tools, data, and actions, a jailbreak becomes a way to unlock capabilities with real consequences. A jailbroken agent might be persuaded to misuse its tools; a jailbroken model connected to sensitive data might be led to disclose it. This is why jailbreak resistance has to be considered alongside agent security and RAG security, covered in securing AI agents and RAG security, rather than as an isolated content problem.

How to defend against jailbreaks

Because no model is fully jailbreak-proof, defense combines reducing the likelihood of a successful jailbreak with limiting what a jailbreak can achieve.

  • Apply layered guardrails, including input and output filtering outside the model, rather than relying on the model's own refusal alone.
  • Validate and constrain output before downstream systems or users act on it.
  • Limit the model's capability and access, so a jailbreak does not unlock tools or data that cause real harm.
  • Monitor for jailbreak patterns and anomalous output, so attempts can be detected and rate-limited.
  • Test continuously through adversarial evaluation, because new jailbreak techniques appear constantly and yesterday's defenses decay.
  • Keep humans in the loop for consequential actions, so a jailbroken model cannot act unilaterally.

The most important shift is to stop treating the model's refusal as the only line of defense. Durable protection comes from controls outside the model and from limiting impact, which we design and validate through AI red teaming and AI penetration testing.

Frequently asked questions

What is an LLM jailbreak?

An LLM jailbreak is a technique that makes a model produce output its safety controls were designed to prevent, such as harmful content or its hidden system prompt. The model is persuaded rather than technically broken, by framing that leads it to prioritize the request over its guardrails.

What is the difference between a jailbreak and prompt injection?

A jailbreak targets the model's safety behavior, getting it to say what it was trained to refuse. Prompt injection targets the application's intended instructions using untrusted input. A jailbreak is about what the model will say; prompt injection is about whose instructions it follows. They often combine in indirect attacks.

Can jailbreaks be fully prevented?

No. A model's guardrails are probabilistic, not absolute, so a determined attacker can often find a path around them. Defense combines reducing the likelihood of a successful jailbreak with limiting what one can achieve, through controls outside the model and constrained capability.

Why do jailbreaks matter if the model only produces text?

Because models increasingly do more than produce text. When a model has access to tools, data, or actions, a jailbreak can unlock real-world consequences, such as misusing tools or disclosing sensitive data. Capability raises the stakes of a successful jailbreak.

Jailbreaks are a permanent feature of working with language models, not a bug that gets fixed once. The realistic goal is to make them hard and to make them harmless when they succeed. If you run models in production, see our AI security service and book a scoping call to discuss testing their resistance.

Want this tested on your own systems?
Our team will scope it with you on a 30-minute call.
Book a scoping call