AI SecurityJune 13, 2026 · 8 min read

AI penetration testing: how to test LLM apps, agents, and RAG

AI penetration testing covers an attack surface a standard pentest was never built for. See what it tests, how it works, and how it differs from red teaming.

A security engineer testing an AI application against prompt injection attacks on screen.

Written by

Alen Bosanac

Offensive Security

AI penetration testing is the practice of attacking an AI application the way a real adversary would, to find the weaknesses that let an attacker manipulate its behavior, extract data it should protect, or abuse the actions it can take. It covers the whole system around the model, not just the model itself: the prompts, the retrieval pipeline, the tools and integrations the model can call, the guardrails meant to contain it, and the infrastructure underneath. As companies move large language models into production, this has become one of the fastest-growing areas of security testing, and one that traditional penetration testing was never designed to handle.

The reason AI needs its own form of testing is that the attack surface is genuinely new. A web application behaves deterministically: the same input produces the same output, and you can reason about its logic. A model does not. It interprets natural language, follows instructions wherever it finds them, and can be steered by text that arrives through data rather than through the user. That single property, that a model treats untrusted input with the same trust as your own instructions, breaks many assumptions security testing relies on. This article explains what AI penetration testing covers, the vulnerabilities it looks for, and how it fits alongside traditional testing and AI red teaming. We deliver this work as part of our AI security service.

What is AI penetration testing?

AI penetration testing is a focused, hands-on assessment of an AI system to find and prove exploitable weaknesses. As with any penetration test, the goal is not a list of theoretical risks but a demonstration of what an attacker can actually achieve: making the system leak confidential data, bypassing its safety controls, or turning the actions it can take against its owner. The difference is the target. Instead of testing a network or a conventional application, the engineer tests the model, the application wrapped around it, and the data and tools it touches.

In practice this means treating the AI application as a system with trust boundaries and attacking across them. Where does untrusted input enter? What can the model do once it is manipulated? What data can it reach? What actions can it trigger? The answers define the scope of the test, and the findings show where the boundaries fail.

Why AI systems need their own testing

Several properties make AI applications resistant to traditional testing and require a dedicated approach.

The model follows instructions from anywhere, including from data it retrieves or processes, so an attacker can plant instructions in a document, a webpage, or a support ticket.
Behavior is non-deterministic, so a single test run is not conclusive and testing must probe the range of possible responses.
The system often has agency: it can call tools, query databases, send emails, or execute code, which turns a manipulated model into a manipulated actor.
Trust boundaries are blurred, because the same channel carries both legitimate content and attacker-controlled content.
Guardrails are probabilistic, not absolute, so controls that look effective in a demo can fail under adversarial pressure.

A traditional application does what its code says. An AI application does what the most persuasive text in its context window says. Testing has to account for that difference, because attackers already do.

What AI penetration testing covers

A thorough AI penetration test assesses the whole stack, not just the model. The most valuable findings usually sit at the seams between components.

The model and its prompts, including the system prompt and how the application separates instructions from user input.
The retrieval pipeline (RAG), including the documents the model can reach and whether an attacker can poison or manipulate them.
Agents and tool use, including every action the model can trigger and what happens when an attacker steers those actions.
Guardrails and safety controls, tested under adversarial pressure rather than in their intended use.
Output handling, including whether model output is trusted by downstream systems in ways that enable injection or code execution.
The supporting infrastructure, including authentication, authorization, rate limiting, and logging around the AI features.

Common AI vulnerabilities we test for

The OWASP Top 10 for Large Language Model Applications is the widely used reference for AI application risk, and a good test maps to it while going beyond a checklist. The vulnerabilities that matter most in practice include the following.

Prompt injection, both direct and indirect, where attacker-controlled text overrides the application's intended behavior. We cover this in depth in prompt injection is not a prompt problem.
Jailbreaks that bypass the model's safety controls to produce harmful or restricted output.
Sensitive information disclosure, where the model reveals system prompts, secrets, or data from other users or contexts.
Insecure output handling, where downstream systems trust model output and an attacker uses that trust to inject code or commands.
Excessive agency, where the model can take actions far beyond what is safe, and a manipulated model becomes a manipulated actor.
RAG poisoning and data manipulation, where an attacker plants content that the model later retrieves and acts on.
Supply chain and model risks, including compromised models, dependencies, or training data.

AI penetration testing vs traditional penetration testing

A traditional penetration test and an AI penetration test share a method, scope, attack, prove impact, report, but differ in what they attack and how. A traditional test targets deterministic systems and known classes of technical vulnerability. An AI test targets a system that can be manipulated through language, that behaves probabilistically, and that may take real-world actions. The skills overlap but are not identical: an AI tester needs to understand both application security and how models and agents actually behave under attack.

For most AI products, both kinds of testing are needed. The AI features sit inside a conventional application and infrastructure that still need testing the usual way, which we cover through application and cloud security. The AI-specific layer needs the dedicated approach described here.

AI penetration testing vs AI red teaming

The two terms are often used interchangeably, but they are not the same. AI penetration testing is typically a scoped assessment of a specific AI application, with defined targets and a focus on finding and proving concrete vulnerabilities. AI red teaming is broader and more adversarial: it simulates a determined attacker pursuing an objective across the whole system, testing detection and response as well as the controls themselves. We explain the discipline in detail in AI red teaming: a practical guide.

In short, a penetration test answers what is exploitable in this AI application, while red teaming answers how far a real adversary could get. Many organizations start with a penetration test and move to red teaming as their AI systems and risk grow.

How an AI penetration test runs

A credible engagement follows a clear arc, the same way any rigorous test does, adapted to the AI context.

Scope and rules, where we agree the targets, the AI features in scope, and the limits, in writing before testing starts.
Mapping the system, where we chart the model, prompts, data sources, tools, and trust boundaries between them.
Attacking, where we run direct and indirect prompt injection, jailbreaks, tool abuse, and data extraction against the live system.
Proving impact, where each finding is demonstrated with reproducible steps and a clear account of the business consequence.
Reporting, where findings are risk-ranked and written for both the engineers who fix them and the leaders who fund the work.
Retesting, where we confirm the fixes actually closed the weaknesses rather than moving them.

AI penetration testing and regulation

Testing is increasingly tied to regulatory expectations. The EU AI Act requires that high-risk AI systems meet standards of accuracy, robustness, and cybersecurity, and testing is how you demonstrate them. The NIST AI Risk Management Framework similarly treats testing and measurement as core to managing AI risk. We align AI penetration testing and its evidence to these obligations, so the same work supports both security and compliance, as set out on our EU AI Act page.

For organizations that build or operate AI as part of their product, this overlap matters: a single well-scoped testing program can satisfy security needs and regulatory evidence at once. Our work with AI-native organizations is described on the AI platforms page.

AI applications create real value and a real new attack surface at the same time. If you are putting models, agents, or RAG pipelines into production, the safest assumption is that they can be manipulated, and the right response is to test them before an attacker does. See our AI security service and book a scoping call to discuss testing your AI systems.

Frequently asked questions

What is AI penetration testing?

AI penetration testing is a hands-on assessment of an AI application that finds and proves exploitable weaknesses, such as prompt injection, jailbreaks, data leakage, and tool abuse. It tests the whole system around the model, including prompts, retrieval, agents, and guardrails, not just the model itself.

How is AI penetration testing different from a normal pentest?

A normal pentest targets deterministic systems and known technical vulnerabilities. An AI pentest targets a system that can be manipulated through language, behaves probabilistically, and may take real-world actions through tools. Both are usually needed, because the AI features sit inside conventional infrastructure that also requires testing.

Do you test the model or the application around it?

Primarily the application around the model, because that is where most real risk lives: retrieval, tools, output handling, and the data paths an attacker would actually use. We test the system as a whole rather than the model in isolation.

How often should we test our AI systems?

Test before launching a new AI feature, after significant changes to the model, prompts, tools, or data sources, and on a regular cadence for systems that handle sensitive data or take consequential actions. AI systems change quickly, so a single test is a snapshot, not a guarantee.

Sources

1OWASP. OWASP Top 10 for Large Language Model Applications. OWASP Foundation, 2025. Link
2NIST. Artificial Intelligence Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology, 2023. Link
3European Parliament and Council. Regulation (EU) 2024/1689 (Artificial Intelligence Act). EUR-Lex, 2024. Link

Related service

AI Security

→

Want this tested on your own systems?

Our team will scope it with you on a 30-minute call.

Book a scoping call

Keep reading

All insights →

01AI Security

AI red teaming: a practical guide for security teams

Read →7 min read

02AI Security

Securing AI agents: the new attack surface of agentic AI

Read →6 min read

03AI Security

MCP security: risks of the Model Context Protocol and how to manage them

Read →5 min read