AI SecurityMay 24, 2026 · 14 min read

AI security: how to secure LLM applications

LLM apps add an attack surface that behaves like nothing before it, and better prompts will not save you. The controls that work live in the architecture.

A security engineer reviewing an LLM application's data flow and tool permissions on screen.

Written by

Alen Bosanac

Offensive Security

Securing an LLM application means controlling what the model can read, what it can do, and what it can say, across every path that data travels through your system. The model is not the only thing you secure. You secure the prompts that reach it, the tools and APIs it can call, the data it retrieves, the outputs it returns to users, and the humans and automations that act on those outputs. An LLM application is a distributed system with a probabilistic component at its center, and that probabilistic component does not respect the trust boundaries you drew for ordinary software.

This matters because the failure modes are new and the blast radius is large. A model with access to a customer database, an email tool, and a payment API is a single component that can read sensitive data, take actions, and be steered by untrusted text in the same request. Regulated companies in finance, healthcare, and infrastructure now ship these systems into production, and the controls they applied to web and API tiers do not cover the model layer. We test these applications the way an attacker would approach them, then help teams build the controls that hold up. This guide walks through what AI security covers, how the attacks work, what we deliver, and how it connects to the frameworks you already answer to. For the full engagement model, see our AI security service.

What AI security actually covers

AI security is the practice of protecting applications that use machine learning models, with LLM applications as the most common and most exposed case today. The scope is wider than prompt filtering, which is where many teams stop. It covers the model, the orchestration layer that connects the model to your data and tools, the retrieval pipeline, the training or fine-tuning data, and the supply chain of models and libraries you pull in. It also covers the surrounding application, because most real incidents start with an ordinary flaw that the model then amplifies.

Prompt and input security covers how untrusted text reaches the model, including direct user prompts and indirect content pulled from documents, web pages, emails, and databases.
Output handling covers what happens to model responses before they reach a user, a browser, a shell, or a downstream API, since model output is untrusted data and must be treated as such.
Tool and agent security covers the functions, plugins, and APIs the model can invoke, the permissions those tools carry, and whether a manipulated model can trigger actions it should never be able to trigger.
Data security covers the retrieval corpus, the embeddings store, the fine-tuning set, and any conversation logs, all of which can leak sensitive information or be poisoned to change model behavior.
Supply chain security covers the base models, model weights, vector databases, and orchestration libraries you depend on, each of which carries its own integrity and provenance risk.

The OWASP Top 10 for LLM applications as a baseline

The OWASP Top 10 for Large Language Model Applications gives the industry a shared vocabulary for these risks, and we use it as a baseline checklist rather than a finished methodology. It names prompt injection, insecure output handling, training data poisoning, model denial of service, supply chain vulnerabilities, sensitive information disclosure, insecure plugin design, excessive agency, overreliance, and model theft. If you have worked with the original OWASP Top 10 for web applications or the OWASP API Security Top 10, the structure will feel familiar. The content is different because the attack surface is different.

We treat the list as a coverage map. Each item points to a class of test we run and a class of control we expect to find. The two items that drive most real damage are prompt injection and excessive agency, and they compound each other. Prompt injection lets an attacker steer the model. Excessive agency means the steered model can do something that matters. A system with strong input handling but a model wired to a high-privilege tool is still dangerous. A system with weak input handling but a model that can only return read-only text is annoying rather than catastrophic. We score risk by combining the two, not by counting findings in isolation.

How prompt injection works and why filters do not fix it

Prompt injection is the defining vulnerability of LLM applications. The model reads instructions and data in the same channel, as plain text, and it cannot reliably tell which is which. When your application concatenates a system prompt, a user message, and a retrieved document into one context window, an attacker who controls any part of that text can try to override the rest. Direct injection comes from the user typing adversarial instructions. Indirect injection is more dangerous, because the malicious text lives in a document, a web page, a support ticket, or an email that the model ingests later, often on behalf of a different and more privileged user. We cover concrete attacks and the limits of defenses in our deep dive on why prompt injection stays unsolved.

Teams reach for input filters and blocklists first, and they help at the margins. They do not solve the problem. Natural language has unlimited ways to express the same instruction, attackers can encode payloads in other languages or in obfuscated form, and a filter tuned to block known phrases will miss the next phrasing. The durable defenses are architectural. You limit what the model can do, you separate trusted instructions from untrusted data where the platform allows it, you require confirmation for consequential actions, and you treat every model output as untrusted input to the next stage.

Treat the model as a confused deputy that will eventually be tricked, then design so that being tricked is not enough to cause harm.

Excessive agency, tool calls, and the agent problem

The move from chat to agents changed the risk profile. A chatbot that returns text can mislead a user. An agent that calls tools can send email, modify records, move money, execute code, and chain those actions together without a human in the loop. Excessive agency is the gap between what a tool can technically do and what the task actually requires. We see it constantly. A model that needs to read three fields gets a database tool with full read and write access. A model that needs to draft a reply gets a send permission. A model that needs to look up an order gets an API key scoped to the entire account.

Scope every tool to the minimum capability the task needs, so a read task gets a read-only tool and a search task cannot delete records.
Bind tool permissions to the end user's identity and authorization, not to a shared service account that holds more rights than any single user should have.
Require explicit human confirmation for actions that move money, change access, send external communications, or delete data, and make that confirmation step impossible for the model to bypass.
Log every tool invocation with the prompt that triggered it, the arguments passed, and the result, so that an investigation can reconstruct what the agent did and why.
Rate limit and budget tool calls per session, so a manipulated or looping agent cannot drain an API, run up cost, or hammer a downstream system.

Retrieval, RAG, and the data layer

Most enterprise LLM applications use retrieval augmented generation, where the system fetches documents from a vector store and feeds them to the model as context. This is where data security and prompt injection meet. The retrieval corpus is an injection surface, because any document an attacker can get into the index becomes text the model will read and may obey. It is also a confidentiality surface, because retrieval that ignores user permissions will happily pull a document the current user was never allowed to see and summarize it back to them.

We test the retrieval pipeline as carefully as the model. We check whether document-level access control is enforced at query time or whether the index is a flat pool that any user can reach through clever questions. We check whether the embeddings store leaks information through similarity queries. We check whether ingestion sanitizes and labels untrusted content. A RAG system that respects identity on the application tier but drops it at the retrieval tier has a classic broken access control flaw, and the model makes it worse by phrasing the leaked data in fluent prose. This work overlaps heavily with a cloud security assessment, since the vector store, object storage, and secrets usually live in the same cloud account.

How we test an LLM application step by step

Our assessment follows the same discipline as any rigorous penetration test, adapted for the model layer. We start by mapping the system, then we attack it, then we verify which findings are real and what they actually let an attacker reach. We do not run a single automated scanner and call it a test. Automated tools have a place for regression checking, but the high-impact findings come from manual work against your specific architecture, the same reason a pentest beats a scan.

We map the full data flow, from every input source through the orchestration layer to every tool, API, and data store the model can reach, and we identify each trust boundary the model crosses.
We test direct and indirect prompt injection against the real system prompt and the real retrieval corpus, including payloads planted in documents the model will later ingest on behalf of other users.
We probe tool and agent behavior to find excessive agency, attempting to make the model call tools outside its intended scope or chain actions toward a privileged outcome.
We test the application tier around the model, including authentication, authorization, session handling, and the API endpoints, because most chains start with an ordinary web or API flaw rather than a model flaw.
We attempt sensitive information disclosure through the model, the retrieval layer, and the logs, checking whether system prompts, other users' data, or secrets can be extracted.
We validate every finding by reproducing it, then we write up the impact, the chain that produced it, and a fix that addresses the architecture rather than the symptom.

What you receive from an engagement

The deliverable is a report your engineers can act on and your auditors can read. We write findings as concrete chains, not abstract categories. Each finding states the entry point, the steps to reproduce, the data or action it exposed, a severity rating tied to real impact, and a specific remediation. We separate the architectural fixes that remove a class of risk from the tactical fixes that close a single hole. We include an executive summary that an accountable leader can read in five minutes and a technical section a developer can work from line by line.

A prioritized findings list with reproducible steps, evidence, and impact, ordered so that the team fixes the chains that matter before cosmetic issues.
An architecture review of the trust boundaries, tool permissions, and retrieval access model, with the design changes that reduce risk regardless of model behavior.
A remediation plan that distinguishes quick wins from structural work, with enough detail that your engineers do not have to guess what we meant.
A retest after you remediate, so you can show that the high and critical findings are closed rather than merely acknowledged.
Mapping from findings to the frameworks you report against, so the same work supports your compliance evidence instead of duplicating it.

How AI security ties to compliance frameworks

Regulators have caught up to the model layer faster than most internal security programs. The EU AI Act sets obligations for providers and deployers of AI systems, with stricter requirements for high-risk uses, and security testing is part of demonstrating that an AI system behaves as intended. If you operate in financial services, the DORA Regulation (EU) 2022/2554, in force since January 2025, requires ICT risk management and testing across the systems that support critical functions, and an LLM wired into those functions falls in scope. If you run essential or important services, the NIS2 Directive (EU) 2022/2555 pushes the same expectation of risk-based testing into a broader set of sectors.

The certification frameworks reach the model layer through their control families. ISO/IEC 27001:2022, with its 93 Annex A controls across four themes, expects you to manage risk in the systems you operate, and an AI application is a system. The SOC 2 Trust Services Criteria expect controls that match the data you handle, and a model with access to customer data is squarely within that. We map our findings to whichever of these you answer to, so an ISO 27001 or SOC 2 program absorbs the AI testing as evidence rather than running it as a separate exercise.

Common mistakes and red flags

The same avoidable mistakes show up across teams that are otherwise strong engineers. They are not exotic. They come from treating the model as ordinary software and from trusting model output the way you would trust your own code.

Giving the model a high-privilege service account because it was the fastest way to ship, which turns every prompt injection into a potential breach of everything that account can reach.
Trusting model output and passing it straight into a SQL query, a shell command, an HTML page, or a downstream API, which reintroduces injection flaws the application tier had already solved.
Relying on prompt filtering and a hardened system prompt as the primary defense, when both can be bypassed and neither limits what a steered model can do.
Ignoring the retrieval layer's access control, so the model becomes a fluent way to read documents the user was never authorized to see.
Skipping logging on tool calls, which leaves you unable to answer what the agent did when something goes wrong and unable to satisfy an auditor who asks.
Testing the model in isolation and never testing the surrounding application, missing the ordinary web and API flaws that most real attack chains start from.

When and how often to test

Test before the first production launch, because the cheapest time to find an excessive agency problem is before the agent has access to live customer data. After launch, test on a cadence and on change. We recommend a full assessment at least annually and a focused assessment whenever you add a new tool, connect a new data source, change the orchestration logic, or swap the underlying model. Each of those changes alters the trust boundaries, and a model upgrade in particular can change how the system responds to injection attempts. If your release pace is fast, a continuous testing model fits better than a once-a-year report that is stale the week after you ship.

Cost and effort

Effort scales with the attack surface, not with the size of the model. A read-only chatbot over public documentation is a small engagement. An agent with several tools, a permissioned retrieval corpus, and connections to internal systems is a larger one, because there are more trust boundaries to cross and more chains to test. The honest drivers of cost are the number of tools the model can call, the sensitivity of the data it reaches, the number of distinct user roles, and whether you want application-tier testing included, which you usually should. We scope against your real architecture rather than quoting a flat number, and the same factors that drive penetration testing cost apply here.

Securing an LLM application is engineering work with a clear method behind it. You map the trust boundaries, you attack the model and the system around it, you scope what the model can do, and you treat every output as untrusted. We do that work for regulated companies and we write it up so both your engineers and your auditors can use it. See the full scope on our AI security service page, and when you are ready to put a real assessment against your system, book a scoping call.

Frequently asked questions

Is prompt injection actually solvable?

Not in the sense of a complete fix at the input layer, and you should be skeptical of any vendor who claims otherwise. The model cannot reliably separate instructions from data, so the practical answer is to assume injection will sometimes succeed and to design so that success does not cause harm. That means least-privilege tools, human confirmation for consequential actions, and treating output as untrusted. You reduce the frequency with input controls and you cap the damage with architecture.

Do we still need a normal penetration test if we test the AI?

Yes. The model sits on top of a web application, APIs, identity, and cloud infrastructure, and most real attack chains start with an ordinary flaw in one of those layers. AI testing covers the model and its orchestration. It does not replace testing the application and infrastructure underneath. The two fit together, and we usually run them as one engagement so the chains that cross between layers get found.

Can we just buy a guardrail product and skip the testing?

Guardrail products help, and we are glad when teams use them. They are one control among several, not a substitute for understanding your own attack surface. A guardrail does not scope your tool permissions, fix your retrieval access control, or tell you which chains an attacker can build through your specific architecture. Testing tells you what the guardrail is missing and where the real exposure lives.

How does this relate to our existing compliance work?

It feeds directly into it. The findings and remediation map to ISO 27001 control families, SOC 2 criteria, and the testing expectations in DORA and NIS2, and they support EU AI Act obligations for AI systems. We write the report so your governance and compliance program can use it as evidence, which means the testing earns its keep twice. Our [GRC work](/services/grc) ties the technical findings to the controls you report on.

What if we are using a third-party model API rather than hosting our own?

Most of the risk is yours regardless. The prompt construction, the tool wiring, the retrieval corpus, the output handling, and the permissions all live in your application, not in the provider's model. Using a hosted API removes some supply chain and model-theft concerns and adds a data-handling relationship you need to govern. The core work of securing prompt injection, excessive agency, and the data layer is the same whether the weights run in your account or someone else's.

Sources

1OWASP. OWASP Top 10 for LLM Applications. Open Worldwide Application Security Project, 2025. Link
2NIST. Artificial Intelligence Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology, 2023. Link

Related service

AI Security

→

Want this tested on your own systems?

Our team will scope it with you on a 30-minute call.

Book a scoping call

Keep reading

All insights →

01AI Security

AI penetration testing: how to test LLM apps, agents, and RAG

Read →8 min read

02AI Security

AI red teaming: a practical guide for security teams

Read →7 min read

03AI Security

Securing AI agents: the new attack surface of agentic AI

Read →6 min read