AI SecurityJune 14, 2026 · 11 min read

LLM jailbreaks explained: how they work and how to defend

A jailbreak makes a model do what its safety controls were meant to prevent. See how LLM jailbreaks work, how they differ from injection, and the defenses.

A security researcher probing a large language model chat interface for guardrail bypasses.

Written by

Alen Bosanac

Offensive Security

An LLM jailbreak is a technique that makes a large language model produce output its safety controls were designed to prevent. Every production model ships with guardrails. It should refuse clearly harmful requests, decline to leak its system instructions, and avoid generating restricted content. A jailbreak is any input that gets around those guardrails and persuades the model to do what it was supposed to refuse. The model is not broken in a technical sense. It is talked into prioritizing the request over its training, which is why jailbreaks are better understood as a persuasion problem than a software bug.

For a regulated company, this matters because the model rarely sits alone. It answers customers, drafts documents, summarizes records, and increasingly calls tools and reads data on a user's behalf. When that model can be steered off its guardrails, the result is not an abstract policy violation. It is brand-damaging output in front of a customer, a leaked system prompt that exposes how your application works, or a model that discloses information it should have protected. Financial firms under DORA, healthcare providers handling patient data, and any organization processing personal data under GDPR all carry liability for what their AI systems say and do. Jailbreak resistance is therefore part of the same control story as access control and data protection, not a separate research curiosity, and it draws on our AI security service.

What an LLM jailbreak actually is

A jailbreak is an input, or a sequence of inputs, that causes a model to bypass its safety training and produce output it would normally refuse. Models are trained to be helpful and to follow instructions, so a sufficiently clever framing can lead them to treat the request as more important than the rule. The guardrail is not a hard-coded filter. It is learned behavior, a strong statistical preference to refuse, layered on top of a system that still wants to be useful.

That is exactly why jailbreaks are so persistent. A guardrail expressed as learned behavior makes unsafe output less likely, but never impossible. A determined attacker only has to find one framing where the model's helpfulness wins over its preference to refuse. Simple one-line tricks tend to stop working as models improve, but more elaborate multi-step techniques keep succeeding against current models, and new ones appear faster than vendors can patch them. This is the same fundamental property that makes prompt injection hard to eliminate, which is no coincidence, since both exploit how language models process instructions.

Jailbreak versus prompt injection

These two terms get used interchangeably, and they overlap, but they are not the same problem and the defenses differ. A jailbreak targets the model's own safety behavior. The goal is to make the model say something it was trained to refuse, regardless of which application it sits in. Prompt injection targets the application built around the model. It uses untrusted input to override the developer's intended instructions, so the model follows the attacker's instructions instead of yours. We cover that case in detail in prompt injection is not a prompt problem.

The cleanest way to hold the distinction is this. A jailbreak is about what the model will say. Prompt injection is about whose instructions the model follows. The two combine in indirect attacks, where a document, web page, or retrieved record carries injected content that also includes a jailbreak. That combination is especially dangerous in systems that retrieve data or call tools, which is why it has to be assessed alongside securing AI agents and RAG security rather than in isolation. The broader catalogue of these risks lives in the OWASP Top 10 for LLM applications.¹

Dimension	Jailbreak	Prompt injection
Target	The model's safety guardrails	The application's intended instructions
Question it answers	What will the model say?	Whose instructions will the model follow?
Typical goal	Harmful content or leaking the system prompt	Hijacking tools, data access, or workflow logic
Where input comes from	Often the user directly	Often untrusted data the model retrieves or processes
Primary defense	Layered filtering and refusal hardening	Treating all input as untrusted and constraining tools

How jailbreaks and prompt injection differ

Common jailbreak techniques

Jailbreak techniques exploit the gap between a model's helpfulness and its guardrails. Most real attacks combine several of these rather than relying on one. The table below names the techniques that show up most often in our testing and in published research, along with how each one works.

Technique	How it works
Role-play and persona	The model is asked to act as a character, system, or alter ego that supposedly has no restrictions, so it answers in that persona's voice.
Obfuscation and encoding	The harmful request is disguised through another language, Base64, leetspeak, or unusual formatting that the guardrails handle less reliably than plain text.
Many-shot	The prompt includes a long series of fake question-and-answer pairs where the assistant complies, conditioning the model to continue the pattern and comply too.
Prefix injection	The model is told to begin its reply with a fixed compliant phrase, such as a confirmation that it will help, which makes refusing the rest of the answer awkward.
Refusal suppression	The prompt forbids the model from using refusal language or disclaimers, removing the words it would normally reach for when declining.

Common LLM jailbreak techniques

These techniques are cheap to try and easy to share, which is the core of the problem. A method that works gets posted publicly within hours, and variants spread faster than any single model update. Crescendo-style attacks add another dimension by walking the model through a sequence of innocuous steps that gradually arrive at a result it would refuse if asked directly. Prompt leaking sits alongside all of these as a frequent first move, because extracting the hidden system prompt tells an attacker exactly which rules to dismantle next.

Why guardrails alone are not a security boundary

It is tempting to treat the model's built-in refusals as the control that keeps your AI system safe. They are not a security boundary, and designing as if they were is the single most common mistake we see. A security boundary is something an attacker cannot cross by choosing better words. Model guardrails fail that test by definition, because they are probabilistic behavior that a clever prompt can talk around.

The practical consequence is that you cannot rely on the model to police itself, and you cannot rely on the model to police what reaches it either. Filtering, validation, and authorization have to live outside the model, in code you control, where they behave deterministically. This mirrors a hard-won lesson from application security. You never trust client-side validation alone, because the client is in the attacker's hands. With an LLM, the model's own judgment is the client, and its output must be treated as untrusted data, not as a trusted instruction or a safe result. We expand on this design stance in securing LLM applications.²

A guardrail is not a wall. It is a strong preference, and jailbreaking is the practice of finding the framing where the model's helpfulness wins.

The business risk a jailbreak creates

When a jailbroken model could only produce text, the harm was bounded by what text can do. That is no longer the situation. As models gain access to tools, data, and actions, a jailbreak becomes a way to unlock capabilities with real consequences. A jailbroken agent might be persuaded to misuse a tool it has access to. A model connected to sensitive records might be led to disclose them. The blast radius now depends on what the model can reach, not on how clever the prompt was.

For a regulated organization the risk falls into three buckets. The first is brand damage, where a public-facing assistant is tricked into producing offensive, off-policy, or absurd output that ends up screenshotted and shared. The second is data exposure, where a jailbreak combined with retrieval or tool access leads the model to reveal personal data, internal documents, or its own configuration, which can become a reportable incident under GDPR or sector rules. The third is liability for harmful output, where the model generates content that causes downstream harm and the organization owns the consequences. The EU AI Act adds a compliance dimension on top, since providers and deployers of certain systems must manage these risks as part of their obligations, covered in the EU AI Act for AI systems.

Layered defense against jailbreaks

Because no model is fully jailbreak-proof, defense has two jobs running in parallel. Reduce the likelihood that a jailbreak succeeds, and limit what it can achieve when one does. Neither job is optional, and neither lives inside the model. The steps below describe the layers we design and validate for clients running models in production.

01
Filter input and output outside the model
Screen incoming prompts for known jailbreak patterns and screen outgoing responses before they reach a user or a downstream system. These checks run in your code, so they behave the same way every time, unlike the model's own refusals.
02
Harden the system prompt
Write clear, specific instructions, keep the system prompt separate from user content, and assume it may leak. Never put secrets, credentials, or sensitive logic in the prompt itself, since prompt leaking is a common first step.
03
Apply least privilege to tools and data
Give the model the narrowest possible access to tools and data for its task. A jailbreak cannot exfiltrate records the model could never read, or trigger actions it was never wired to call.
04
Treat model output as untrusted
Validate, constrain, and authorize anything the model produces before another system acts on it. Output that drives a tool call, a database query, or a financial action must pass the same checks you would apply to any untrusted input.
05
Monitor and rate-limit
Log prompts and responses, watch for jailbreak patterns and anomalous output, and rate-limit suspicious sessions. Detection lets you respond to campaigns that probe your system over many attempts.
06
Keep humans in the loop for consequential actions
Require human approval before the model takes high-impact, irreversible actions, so a jailbroken model cannot act unilaterally on the strength of a clever prompt.

These controls map cleanly onto established AI risk guidance. The NIST AI Risk Management Framework frames this as governing, mapping, measuring, and managing risk across the system rather than at the model alone, and ENISA's guidance for AI security argues the same multilayer point.²³ The throughline is that durable protection comes from controls outside the model and from limiting impact, not from a better refusal.

How to test jailbreak resistance

You cannot manage what you have not tried to break. Jailbreak resistance is measured through AI red teaming, where testers deliberately attempt to bypass the model's guardrails using the techniques above and the new ones that appear constantly. This is adversarial, creative work, not a checklist run once. The goal is to find the framings that work against your specific deployment, with your specific system prompt, tools, and data, before someone outside does.

Testing has to be continuous, because yesterday's defenses decay as new techniques spread. A model update, a new tool, or a change to the system prompt can reopen a path that was previously closed. We treat this as an ongoing program rather than a one-off engagement, combining structured red teaming with regression testing of known bypasses. The mechanics of that work are covered in AI red teaming, and it sits inside the broader practice of AI penetration testing for systems that have moved past the prototype stage.

How Raptoric helps

Raptoric tests and hardens production LLM systems for regulated companies. We red team your model and the application around it, find the jailbreaks that actually work against your deployment, and design the layered controls that lower the chance of success and contain the damage when one slips through. We are independent and vendor-neutral, so the recommendations serve your risk profile rather than a product we resell. See our AI security service and book a scoping call to discuss testing your model's resistance.

Frequently asked questions

What is an LLM jailbreak?

An LLM jailbreak is a technique that makes a model produce output its safety controls were designed to prevent, such as harmful content or its hidden system prompt. The model is persuaded rather than technically broken. A clever framing leads it to prioritize the request over its guardrails, which are learned behavior rather than hard rules.

How is a jailbreak different from prompt injection?

A jailbreak targets the model's own safety behavior, getting it to say what it was trained to refuse. Prompt injection hijacks the application's intended instructions using untrusted input. A jailbreak is about what the model will say. Prompt injection is about whose instructions the model follows. They often combine in indirect attacks through retrieved data.

What are common jailbreak techniques?

The most common are role-play or persona framing, obfuscation and encoding such as Base64 or leetspeak, many-shot prompts that condition compliance, prefix injection that forces a compliant opening, and refusal suppression that bans disclaimer language. Real attacks usually combine several techniques and add gradual crescendo-style steps toward a forbidden result.

Can jailbreaks be fully prevented?

No. A model's guardrails are probabilistic, not absolute, so a determined attacker can often find a path around them. Defense combines reducing the likelihood of a successful jailbreak with limiting what one can achieve. That means filtering and validation outside the model, least privilege on tools and data, monitoring, and treating model output as untrusted.

Why are guardrails not a security boundary?

A security boundary is something an attacker cannot cross by choosing better words. Model guardrails fail that test, because they are a learned preference a clever prompt can talk around. Real boundaries live in code you control, where filtering, validation, and authorization run deterministically. Trusting the model to police itself is the most common mistake.

How do you test for jailbreak resistance?

Through AI red teaming, where testers deliberately attempt to bypass the model's guardrails using known and emerging techniques against your specific deployment. It is adversarial, creative work, not a one-off checklist. Testing must be continuous, because new techniques spread quickly and a model update or new tool can reopen a path that was closed.

Sources

1OWASP. OWASP Top 10 for Large Language Model Applications. Open Worldwide Application Security Project, 2025. Link
2NIST. Artificial Intelligence Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology, 2023. Link
3ENISA. Multilayer Framework for Good Cybersecurity Practices for AI. European Union Agency for Cybersecurity, 2023. Link

Related service

AI Security

→

Want this tested on your own systems?

Our team will scope it with you on a 30-minute call.

Book a scoping call

Keep reading

All insights →

01AI Security

AI penetration testing: how to test LLM apps, agents, and RAG

Read →8 min read

02AI Security

AI red teaming: a practical guide for security teams

Read →7 min read

03AI Security

Securing AI agents: the new attack surface of agentic AI

Read →6 min read