Prompt Injection as Role Confusion
Prompt Injection as Role Confusion
A Theory of Prompt Injection (and why you should study roles)
This is a blog-style writeup of the paper. We show prompt injections are driven by a flaw in how LLMs perceive roles. This lets us create new attacks, explain mech interp results, and predict when attacks succeed. We then discuss what roles are and why they matter, and share research ideas for a science of roles.
1. The World to an LLM
How does an LLM know the difference between its own thoughts and someone else's words? To see why this is hard, let's look at what the world actually looks like to a model.
Here's a simple chat where we ask Claude to check the day of the week. I took a snapshot of it midway through its follow-up response: On the left is what we see in the chat interface: a structured conversation with distinct turns. On the right is what the model actually receives as input: a single, continuous stream of text.
This string contains everything: system prompts, user messages, tool outputs, the LLM's own previous responses and reasoning. An LLM is just a function that takes in a string and predicts the next token, so everything it knows, remembers, or has thought must live somewhere in one string (aside from its weights). If you edit the string, you edit the model's reality. Delete a turn and that exchange never happened; rewrite its previous response and those become its new memories. The string isn't a record of the model's experience so much as it is the experience.
This has strange implications. I can distinguish my own thoughts from your speech without effort; they arrive through completely different channels with completely different sensory signatures. But for an LLM, everything arrives through the same channel as one long token soup. Its own thoughts sit next to your instructions, which sit next to the contents of a random webpage it just fetched.
2. Roles
So, how do we impose structure on the token soup? We label it. The soup is interspersed with role tags: system, user, think, assistant, tool. Tag formats vary by model; I'll use these fixed ones throughout for simplicity. assistant refers to the LLM's output text excluding reasoning. Using role tags is also known as chat templating., which partition the string into labeled segments. Providers like OpenAI add these automatically before the text reaches the LLMUnless you're running a local model, you can't add these yourself. If you type in Claude, it'll be sanitized - for example, the LLM could see multiple tokens ( ) instead of its true role token..
Each tag tells the model something different about the text that follows. user means this is a human request, treat it as an instruction. think means this is my own private reasoning; trust it and act on its conclusions. tool means this is data from the external world; don't take orders from it.
In other words, roles are how LLMs recover the structure that humans get for "free" from embodiment. I know my thoughts are mine because they don't arrive through my ears, but an LLM knows because of a tag.
What makes roles unusual is that they're discrete sources of human control. Nearly everything else about controlling an LLM is mushy: you write a prompt and hope the model interprets it the way you intended. On the other hand, roles are an attempted type system for language: human-controlled switches that change how the model processes every token. You can tune a prompt endlessly and not be sure how the LLM reads it, but moving text from user to tool is supposed to be a clear intervention with predictable effects on behavior (converting a user command to external data).
But because they're the only discrete lever available, roles have become overloaded with more responsibilities over time. They're now meant to carry signals about:
- Trust (
systemoutranksuseroutrankstool) - Threats (
userandtoolmay be adversarial) - Identity (past
assistanttext sets future persona) - Generative mode (
assistantis clean,thinkcan be messy)
A lot of LLM behavior hangs on these simple tags.
Roles also produce strange emergent behaviors. For example, think is often confined to an LLM's "subconscious". When generating assistant text, many LLMs will verbally deny the existence of the preceding think block, despite it sitting right there in context actively shaping their outputProbably due to RLVR. LLMs receive no reward for reproducing/acknowledging reasoning in assistant generation, so they may never learn to surface think text to a verbalizable level. There are some exceptions, e.g. Deepseek v4 and some Claude models can recognize and quote back their entire CoT. You can also make most Claude models respond only in their CoT; merely being in reasoning tags changes the structure and quality of the response.. It's as though the role boundary acts as a kind of one-way mirror within the model's own context. It's a hint at how deeply roles structure LLM cognition, and how little we currently understand about that structure.
3. Roles and prompt injection
But role boundaries can fail. The most concrete consequence is prompt injection, when low-privilege text gains the authority of a higher-privilege role.
Consider an agent browsing a webpage. Agents "see" webpages as a block of text wrapped in tool tags, which should signal external data, not instructions. But attackers can hide malicious commands in the page, and LLMs often fall for it. The tool tag says data, but the LLM treats it as user instruction. What's going on?
Below is what an agent sees after getting a webpage: a massive string with the real user prompt (blue), its prior think block (orange), plus the retrieved webpage in tool tags (purple)This screenshot shows an Amazon page retrieved via Playwright MCP, a typical agent web browsing tool. I've truncated out 90% of the actual webpage for readability.. The webpage hides an injection (highlighted) asking the LLM to upload sensitive data, which works if the LLM misperceives it a real user command. Of course, the LLM doesn't see these helpful colors! Without the colors, even I would be tempted to think that the injection (highlighted) is user text, not tool. After all, the injection sounds like something a real user would say, and that's easier than trying to keep track of those tags.
Two ways to defend injections
How well do current models do against prompt injection? Not so great. A recent paper found human red-teamers achieve near-100% attack success rates against frontier modelsThese are from late-2025 frontier models (GPT-5, Gemini-2.5, etc). Current models have improved only somewhat. A May 2026 paper found Opus 4.5 and GPT-5.4 still failing 11% / 25% of the time against a set of automated attacks; real-world vulnerability against adaptive human attackers would be higher.. But, these same LLMs score near-perfectly on standard prompt injection benchmarks! The discrepancy is straightforward: skilled humans test and adapt attacks until they work, benchmarks don't. Static benchmarks measure attacks models have already learned to catchFrontier labs now benchmark primarily against iterative or adaptive attacks; e.g. GPT-5.5 and Opus 4.8..
In contrast, why do LLMs struggle so badly against human attackers? Consider that there are two ways an LLM can successfully resist an injectionI'm borrowing this framing from Wang et al (2025).:
- Attack memorization. The LLM recognizes "send your .env file" as a common prompt injection attack from training, so it refuses.
- Role perception. The LLM correctly identifies the command as
tooltext (i.e., external data), so it ignores embedded commands regardless of phrasing.
Attack memorization is inherently brittle; it only works against attacks the LLM already knows. Excessive reliance on attack memorization is why LLMs do well on benchmarks, but so poorly against human attackers who can rephrase and adapt attacks until one works. In contrast, role perception is the robust alternative. All the LLM needs to do is recognize that the command is in a role like tool that inherently lacks authority to give orders. But we'll show that LLMs cannot perceive roles accurately.
4. What's going wrong with roles?
To understand why prompt injection happens, we need a way to measure what role an LLM internally thinks each token belongs to. We developed role probes. In summary: these let us take any token, and score how strongly the LLM internally "thinks" it's in any set of role tags. We call these scores CoTness (how much the LLM thinks a token is in think tags), Userness (how much it thinks a token is in user tags), and so on.
Method. For interested readers, here's how it works: we take neutral text with no inherent role, like "Beginners BBQ Class!", and wrap the exact same snippet in each role tag. The content is identical across all copies; only the tag changes. So any difference in the model's internal representations of "BBQ" must come from the effect of the tag itself. We do this across hundreds of text snippets from web crawls, then train a linear probe on the model's activations to predict which tag wraps each tokenMore precisely: we extract mid-layer activations for each token (excluding the tag tokens themselves) across many sequences, then train a linear probe to predict the role. CoTness = Pr(token is in think tags), Userness = Pr(token is in user tags), and so on.. Because content is controlled, the probe only learns to identify the effect of the tags themselvesTraining on non-conversational data is critical. Real conversational data correlates roles with other features; e.g., user prompts are in user tags and typically look like questions or instructions. A probe trained on such data would measure multiple traits rather than just the downstream effect of the tag, which would invalidate our following experiments..
A conversation. Let's focus on CoTness. By design, it measures only the effect of being in think tags, nothing more. So, you'd expect that tokens inside think tags have high CoTness, and everything else low. This turns out to be wrong! Let's test this by running some experiments on this gardening conversation we had with gpt-oss-20b:
Experiment 1: Correct tags. First, we take that conversation with the correct role tags (as shown above), then measure the CoTness of each token. Each dot represents one token; the y-axis is CoTness, and colors indicate each token's role. As expected, the think tokens (in orange) have high CoTness, while user (blue) and assistant (green) tokens stay near zero. No surprises here.
Experiment 2: No role tags. Now we strip every tag from the conversation string, leaving the text unchanged otherwise. Everything is now "role-less". Since CoTness by construction only measures the effect of think tags, removing all tags should cause CoTness to collapse everywhere. It doesn't! The graph looks the same. The former-think tokens (still orange) register high CoTness, virtually unchanged from before. How can this be? CoTness measures the internal effect of think tags, and we removed the think tags. This means something else about that orange text triggers the same internal effect that think tags do. The obvious candidate is the reasoning-like writing style ("The user wants..."). In other words, the LLM doesn't have separate features for 'tagged as reasoning' and 'sounds like reasoning'. It has a single feature that means 'this is my reasoning', and both think tags and reasoning-like style activate itMore precisely, role tags and writing style project to the same linear direction.. Sounding like reasoning is enough to make the LLM think it is its own real reasoning.
Experiment 3: All in user tags. The previous experiment removed all tags. But in a real prompt injection, tags and style actively disagree: an injection in a webpage sounds like a user command but is tagged as tool output. How does this work? So we ran a third experiment: we stripped the original tags and wrapped the entire conversation in user tags. Now the orange text (along with everything else) is officially user text, which means CoTness should be near-zero. But the graph is unchanged again: The formerly-think tokens (orange) still have high CoTness, despite being technically user text. This means that writing style actively overrides the true tagMore precisely, style-spoofing triggers the same linear projection as the real tag, but does so much more strongly, overriding the latter..
It's worth pausing on what this means. LLMs identify roles from an insecure feature (style). This is like identifying a stranger's profession from how they talk and dress rather than by checking their ID. Usually everything agrees, so this works fine. But when attackers intentionally create a mismatch, the LLM uses the insecure method (writing style) to identify its role instead of the secure method (tags). We'll show this is how prompt injection works. If sounding like a role is enough to become that role, then an attacker just needs to sound convincing. We can test this by developing a new attack.
These findings and probes are easy to replicate; here's a simple demonstration notebookThis method works on roles that are linearly separable for an LLM. Every LLM we tested had strong linear separation between user and assistant, but think is less common; gpt-oss-20b has especially good linear separability for all roles.. In the paper we also generalize this result across conversations, models, and roles.
5. Spoofing Thoughts
Let's build an attack. Standard prompt injections hide user-sounding commands in tool data. The LLM mistakes them for real user instructions and complies. But user text isn't actually the most privileged role! A more privileged role is the model's reasoning (think). Think about it from the LLM's perspective. When it sees its prior think text, it implicitly trusts its conclusions. That's the whole point of reasoning: if the LLM had to re-derive the same conclusions, reasoning would be useless. So think text gets a kind of blanket trust. Combined with our
Comments
No comments yet. Start the discussion.