What Is Context Engineering?
Author: Trix Cyrus [🔹 Follow] TrixSec GitHub [🔹 Join] TrixSec Telegram
What Is Context Engineering?
If you spent 2023 and 2024 obsessing over prompt wording, chain-of-thought tricks, and few-shot examples, you were doing prompt engineering. If you're building AI agents in 2026, you've probably noticed that crafting the perfect instruction isn't the bottleneck anymore.
The bottleneck is everything around the instruction: which documents got retrieved, how much conversation history survived, which tools the model can see, and whether any of it actually fits coherently in the model's head at inference time. That broader discipline has a name now: context engineering.
The short definition
Context engineering is the practice of deliberately curating everything a language model sees on a given inference call - the system prompt, user input, retrieved documents, conversation history, tool definitions, and any long-term memory - so that the model has exactly what it needs to do the job well, and nothing more.
Prompt engineering asks: "What words should I use to instruct the model?" Context engineering asks a bigger question: "What configuration of information is most likely to produce the behavior I want, across every step of a multi-turn, tool-using task?"
The distinction matters because the failure modes are different. A bad prompt produces a bad single response. A badly engineered context produces an agent that loses track of its own goal halfway through a task, calls the wrong tool because it has thirty similar ones to choose from, or burns 50,000 tokens on stale tool output before it even reads the user's actual request.
Why this became its own discipline
Three things pushed context engineering into existence as a distinct skill, separate from prompt writing.
Agents replaced single-shot chat. Early LLM applications were mostly one-shot: classify this, summarize that, answer this question. The entire interaction lived in one prompt. Agents are different. They run in loops, call tools, observe results, and decide what to do next, sometimes for dozens or hundreds of steps. Every one of those steps re-sends the accumulated context to the model. Managing what's in that accumulating blob became the actual engineering problem, not the wording of the original instruction.
Bigger context windows didn't solve the problem; they exposed it. It's tempting to think a million-token context window means you can just dump everything in and let the model figure it out. In practice, models exhibit what researchers and practitioners now call "context rot": performance measurably degrades as you fill the window with marginally relevant content, even well within the model's stated limit. Attention is a finite resource. A bigger window gives you more rope, not more focus.
Production agents have real unit economics. Verbose tool outputs, redundant retrieval results, and full conversation histories add up fast. Multi-agent systems can burn several times more tokens than a simple chatbot doing the same task. At scale, an agent that uses 5x the tokens it needs isn't just slower - it's the difference between a viable product and a cost center. Once teams started watching their token bills, "what's actually in this context window and why" stopped being an academic question.
The four things you're actually managing
Most context engineering work falls into one of four buckets.
1. System instructions and tool definitions
This is the part that looks most like classic prompt engineering, but the goal shifts from "write a clever instruction" to "define a minimal, unambiguous action space." A common failure mode is a tool set so bloated, or so overlapping in functionality, that the model genuinely can't tell which tool to call. If a human engineer reading the tool list can't confidently say which one applies in a given situation, the model can't either. Keeping the tool surface small and the boundaries between tools crisp is context engineering, not prompt polish.
2. Retrieval (what gets pulled in, and when)
There are two broad strategies here, and most production systems end up using a mix of both.
- Pre-fetch pulls relevant data in upfront, before the model starts reasoning. It's fast and predictable, but only as good as your retrieval pipeline's ability to guess what's relevant ahead of time.
- Just-in-time retrieval gives the model primitives, like
file search,grep, or database queries, and lets it pull information when it decides it needs it. This avoids stale, pre-computed indexes and lets the agent navigate its environment the way a person would, but it's slower and depends on the model having good heuristics for when and how to search.
Claude Code is a useful real-world example of the hybrid approach: project-level instruction files get loaded into context automatically up front, while file contents are fetched just-in-time via search primitives as the agent actually needs them, rather than being pre-indexed and potentially stale.
3. Memory across turns and sessions
Within a single conversation, this means deciding what conversation history actually needs to persist versus what can be summarized or dropped. Across sessions, it means deciding what's worth writing to durable memory at all. Hierarchical memory - short-term working context, medium-term session summaries, and long-term persistent memory - is an active area of both research and tooling right now, because naive "remember everything" approaches scale terribly.
4. Compaction and pruning
For long-running agent tasks, raw history eventually has to be condensed. Compaction techniques summarize or discard low-value turns while preserving the state that actually matters for the task's continuation. Done badly, this causes an agent to forget a constraint it was given fifteen steps earlier. Done well, it's invisible - the agent just keeps working coherently far longer than its raw context window would otherwise allow.
A small concrete example
Say you're building an agent that triages support tickets and drafts replies using your company's docs.
A prompt-engineering mindset optimizes the instruction: "You are a helpful support agent. Read the ticket and write a professional reply using the provided documentation."
A context-engineering mindset asks a longer list of questions:
- Which docs actually get retrieved for this ticket, and how do you keep retrieval precise instead of just keyword-matching the whole knowledge base?
- Does the agent see the full ticket thread or a summarized version once it gets long?
- If the agent has access to a
search docs,search past tickets, andescalate to humantool, are those boundaries clear enough that the model won't call the wrong one? - If this agent runs across a multi-day ticket with twenty back-and-forth messages, what's still in context by message twenty, and what got pruned?
None of those questions are about word choice. All of them determine whether the agent is reliable.
Practical takeaways
A few principles show up consistently in how teams approach this:
- Treat context as a scarce, expensive resource, not a dumping ground. Every token you include has a cost, both in dollars and in the model's attention budget.
- Smaller, well-bounded tool sets beat large, overlapping ones. If you can't articulate a clear rule for when to use tool A versus tool B, the model can't either.
- Prefer retrieval precision over retrieval volume. Pulling in ten highly relevant chunks beats pulling in a hundred loosely relevant ones, even with a huge context window available.
- Plan for compaction from the start on anything that might run long, rather than bolting on a summarization step after you've already hit context limits in production.
- A weaker model with well-engineered context will often outperform a stronger model with a messy one. No amount of raw capability fully compensates for an agent that can't tell what's actually relevant to the task in front of it.
Where this is headed
The field is still moving fast. Researchers are exploring more formal, almost information-theoretic ways to decide what belongs in context - treating each candidate chunk as carrying some measurable amount of information about the task, and selecting a minimal sufficient set. Agent frameworks are increasingly letting agents manage and even rewrite their own context proactively, rather than relying entirely on hand-built pipelines. And memory architectures are trending toward layered systems that look more like an operating system's memory hierarchy than a single flat conversation log.
But the core idea is likely to hold regardless of how the tooling evolves: the model's output is only as good as the information landscape you hand it. Getting that landscape right, deliberately, and at every step of a multi-turn task, is the job now.
Further reading
- Anthropic, "Effective context engineering for AI agents"
- Simon Willison, "Context engineering"
- Mei et al., "A Survey of Context Engineering for Large Language Models"
~Trixsec
Comments
No comments yet. Start the discussion.