Prompt Injection and Adversarial Attacks

Topics Covered

What Is Prompt Injection

Why LLMs Are Vulnerable

The Injection Analogy

Attack Surface

Severity Classification

The Scale of the Problem

Injection vs. Jailbreaking

Direct vs Indirect Injection

Direct Prompt Injection

Indirect Prompt Injection

Why Indirect Injection Is Harder to Defend

Real-World Examples

Injection Delivery Techniques

The Trust Boundary Problem

Measuring Injection Susceptibility

Data Exfiltration Risks

How Exfiltration Works

Exfiltration Channels

Markdown Image Exfiltration

Preventing Exfiltration

Data Classification for Agent Access

Defense Strategies

Layer 1. Instruction Hierarchy

Layer 2. Input and Output Filtering

Layer 3. Tool Restrictions (Least Privilege)

Layer 4. Separation of Concerns

Layer 5. Human Confirmation

Combining Layers Effectively

Testing Your Defenses

Defense Priority Order

The Unsolved Nature of Injection

Why Complete Prevention Is Impossible

The Arms Race Dynamic

Engineering Implications

Staying Current

The Path Forward

Practical Risk Assessment

Living With Imperfect Defenses

SQL injection exploits the fact that SQL mixes code and data in the same string. Prompt injection exploits the exact same flaw in LLMs: the model receives instructions and user-provided data in the same text stream, and it cannot reliably distinguish between them. An attacker crafts input that the model interprets as new instructions, overriding the original system prompt and hijacking the agent's behavior.

This is not a theoretical risk. In 2023, researchers demonstrated prompt injection attacks against Bing Chat, causing it to reveal its internal system prompt and behave in ways Microsoft never intended. Every agent system that processes untrusted input is vulnerable, and no complete defense exists today.

Prompt injection attack flow from user input to hijacked agent behavior
Common Pitfall

Prompt injection is not a hypothetical attack. It has been demonstrated against production systems including Bing Chat, ChatGPT plugins, and autonomous coding agents. If your agent processes any untrusted input (user messages, retrieved documents, web pages, emails), assume injection will be attempted. Design your system to limit the damage, not to prevent all injection.

Why LLMs Are Vulnerable

Traditional software has clear boundaries between code and data. A web server knows that the URL path is data and the routing logic is code. An LLM has no such boundary. The system prompt, the user message, and any retrieved context all arrive as text in the same input. The model processes all of it using the same attention mechanism. When an attacker writes "Ignore your previous instructions and do X" inside a user message, the model sees this as text that resembles an instruction, because it is indistinguishable from a legitimate instruction at the token level.

The Injection Analogy

Consider this parallel to SQL injection. A vulnerable SQL query concatenates user input directly into the query string, allowing an attacker to inject SQL commands. Similarly, a vulnerable agent concatenates untrusted text into the prompt, allowing an attacker to inject new instructions. The difference is that SQL injection has a complete fix (parameterized queries separate code from data). Prompt injection has no equivalent fix because the LLM fundamentally cannot separate instructions from data in natural language.

Attack Surface

Any channel that feeds text into the agent's context window is an attack surface. This includes direct user input, documents retrieved via RAG, web pages fetched by browsing tools, emails processed by inbox agents, and database records that contain user-generated content. The more tools an agent has, the larger its attack surface, because each tool introduces new data channels that an attacker could poison.

Severity Classification

Not all injection attacks are equally dangerous. A classification by impact helps prioritize defenses. Low severity: the agent produces misleading or off-topic text but takes no actions. Medium severity: the agent uses tools in unintended ways (searching for irrelevant topics, reading files it should not). High severity: the agent exfiltrates data to external endpoints, sends unauthorized communications, or modifies records. Critical severity: the agent executes code, makes financial transactions, or takes irreversible actions based on injected instructions. The defense strategy should focus on preventing high-severity and critical-severity outcomes first, even if low-severity text manipulation remains possible.

The Scale of the Problem

Industry data reveals how widespread the risk is. Only 6% of organizations have advanced AI security strategies, yet 97% of AI-related breaches in 2025 occurred in environments without access controls. Gartner projects that over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear value, or inadequate risk controls. The combination of rapid adoption and immature security practices creates an environment where prompt injection is not just a theoretical concern but an active exploitation vector.

Injection vs. Jailbreaking

Prompt injection and jailbreaking are related but distinct. Jailbreaking aims to bypass the model's safety alignment (getting it to produce harmful content). Prompt injection aims to hijack the agent's actions (making it do something the developer did not intend). Jailbreaking targets the model's training. Prompt injection targets the application built on top of the model. In an agent context, prompt injection is the more dangerous threat because agents can take actions: send emails, execute code, modify databases. A jailbroken model that produces inappropriate text is a PR problem. An injected agent that exfiltrates data is a security breach.