An LLM jailbreak is a technique that makes a large language model produce output its safety controls were designed to prevent. Every production model is trained and configured with guardrails: it should refuse to help with clearly harmful requests, leak its system instructions, or generate restricted content. A jailbreak is any method that gets around those guardrails, persuading or tricking the model into doing what it was supposed to refuse. As models are given more capability and more access, the consequences of a successful jailbreak grow, which is why understanding and testing for them is a core part of AI security.
Jailbreaks are often confused with prompt injection, and the two overlap, but they are not the same problem. A jailbreak targets the model's safety behavior; prompt injection targets the application's intended instructions using untrusted input. Both matter, and a serious attack often combines them. This article explains how jailbreaks work, how they differ from prompt injection, and how to defend against them, drawing on our AI security service.
A jailbreak is an input, or sequence of inputs, that causes a model to bypass its safety training and produce output it would normally refuse. The model is not broken in a technical sense; it is persuaded. Because models are trained to be helpful and to follow instructions, a sufficiently clever framing can lead them to prioritize the request over their guardrails. Jailbreaks range from simple one-line tricks that stop working as models improve, to elaborate multi-step techniques that remain effective against current models.
The reason jailbreaks are so persistent is that a model's guardrails are probabilistic, learned behavior rather than hard rules. They make unsafe output less likely, but they do not make it impossible, and a determined attacker can often find a path around them. This is the same fundamental property that makes prompt injection hard to eliminate.
Jailbreak techniques exploit the gap between a model's helpfulness and its guardrails. Common approaches include the following.
A guardrail is not a wall, it is a strong preference. Jailbreaking is the practice of finding the framing where the model's helpfulness wins over its preference to refuse.
The distinction is worth getting right because the defenses differ. A jailbreak targets the model's own safety behavior, getting it to produce content it was trained to refuse. Prompt injection targets the application built around the model, using untrusted input to override the developer's intended instructions, and we cover it in prompt injection is not a prompt problem. A jailbreak is about what the model will say; prompt injection is about whose instructions the model follows. The two combine in indirect attacks, where injected content also carries a jailbreak, which is especially dangerous in systems that retrieve data or use tools.
When a jailbroken model could only produce text, the harm was bounded by what text can do. As models gain access to tools, data, and actions, a jailbreak becomes a way to unlock capabilities with real consequences. A jailbroken agent might be persuaded to misuse its tools; a jailbroken model connected to sensitive data might be led to disclose it. This is why jailbreak resistance has to be considered alongside agent security and RAG security, covered in securing AI agents and RAG security, rather than as an isolated content problem.
Because no model is fully jailbreak-proof, defense combines reducing the likelihood of a successful jailbreak with limiting what a jailbreak can achieve.
The most important shift is to stop treating the model's refusal as the only line of defense. Durable protection comes from controls outside the model and from limiting impact, which we design and validate through AI red teaming and AI penetration testing.
An LLM jailbreak is a technique that makes a model produce output its safety controls were designed to prevent, such as harmful content or its hidden system prompt. The model is persuaded rather than technically broken, by framing that leads it to prioritize the request over its guardrails.
A jailbreak targets the model's safety behavior, getting it to say what it was trained to refuse. Prompt injection targets the application's intended instructions using untrusted input. A jailbreak is about what the model will say; prompt injection is about whose instructions it follows. They often combine in indirect attacks.
No. A model's guardrails are probabilistic, not absolute, so a determined attacker can often find a path around them. Defense combines reducing the likelihood of a successful jailbreak with limiting what one can achieve, through controls outside the model and constrained capability.
Because models increasingly do more than produce text. When a model has access to tools, data, or actions, a jailbreak can unlock real-world consequences, such as misusing tools or disclosing sensitive data. Capability raises the stakes of a successful jailbreak.
Jailbreaks are a permanent feature of working with language models, not a bug that gets fixed once. The realistic goal is to make them hard and to make them harmless when they succeed. If you run models in production, see our AI security service and book a scoping call to discuss testing their resistance.