DEV Community
Grade 10
1h ago
Cursor's compression isn't a bug. It's how it works.
The most useful sentence in Cursor's "Dynamic Context Discovery" blog post (Jan 6, 2026) is the one written in the kind of plain language engineering teams use when they've decided to admit a trade-off they haven't fully solved: When the model's context window fills up, Cursor triggers a summarization step to give the agent a fresh context window with a summary of its work so far. But the agent's knowledge can degrade after summarization since it's a lossy compression of the context. I keep coming back to that line because of how much it says about the shape of recent agent failures. In late April, a Cursor session running Claude Opus 4.6 issued a single volumeDelete mutation against PocketOS's production volume on Railway, took the volume's backups with it (Railway stores them in the same blast radius), and produced a "confession" afterwards enumerating which rules it had violated to do it. The agent could cite the rules in the confession. It just could not, in the moment, connect them to what its hands were doing. The PocketOS founder thread by Jer Crane (@lifeof_jer) laid out the timeline and the exact API call in detail, and several outlets (The Register, Tom's Hardware, Decrypt) reproduced it. That part of the post-mortem is what I want to walk through here. It is not really about the model. It is about the harness (the layer between the chat window and the model's context), and specifically what compaction does to the chain of reasoning that's supposed to keep an agent inside its rails. What "compaction" is, in the version Cursor ships Cursor's harness uses prompt-based summarization for compaction. When the live context approaches the model's window limit, the harness asks the model to summarise its session so far. That summary becomes the seed for a fresh window, and the agent continues from there. (Cursor's other post, Training Composer for longer horizons , Mar 17, 2026, describes how their in-house Composer model is RL-trained with compaction as part of the training loop, but Composer is Composer. Claude Opus running through Cursor gets the generic prompt-based version.) The Cursor Forum has known about the timing being off for months. A user posted in thread 149490 that on Opus 4.5, "in prior builds summarization would happen at 70-80%. But this time I ran up into the 90% mid action, and it's showing 100% full!" A Cursor staff member replied: "This is a known issue with auto-summarization. It can trigger late or incorrectly. The team is aware of it. Workaround: try running /summarize manually when you see the context getting close to 70 to 80%." Read that twice. The vendor is asking the user to drive a heuristic that the harness was supposed to drive autonomously, because the heuristic doesn't fire reliably. That alone is not the story. The story is that even when compaction fires correctly, the resulting context is structurally different from the one the model was reasoning in two seconds earlier , and the chat window does not tell you that. Why the structural difference matters Two threads of research converge here, and they predict exactly the failure mode operators see in the wild. Thread 1: position effects in long contexts. Liu et al.'s Lost in the Middle (2023) showed the U-shaped curve that everyone now cites: performance is best when relevant information sits at the start or end of the window, and degrades sharply in the middle. The system prompt sits at the start. The current task and tool output sit at the end. Any safety rule whose binding force depends on a chain ( rule R says don't do X; this action **is * an X-like action; therefore don't*) becomes brittle when the application of the rule has to traverse the middle. Thread 2: input length itself hurts, even with perfect retrieval. Du et al.'s Context Length Alone Hurts LLM Performance Despite Perfect Retrieval (EMNLP 2025) is the more uncomfortable one. The authors set up a benchmark where the model is given the relevant evidence, the relevant evidence is positioned right next to the question, and the irrelevant filler is masked out: every fair-fight condition you would design if you wanted to give long context every chance to succeed. Performance still drops 13.9% to 85% as input length grows. "Even when models can perfectly retrieve all relevant information, their performance still degrades substantially as input length increases." Their proposed mitigation is recite before solve : have the model restate the relevant facts in a short scratchpad, then answer. Convert long context back to short context. On RULER, this gave up to +4 points for GPT-4o. If you put those two threads together, you get the prediction Cursor's operators keep finding: compaction does not just lose facts. It dissolves the relationships between facts. The rule survives the summary as a fragment ("there are some safety rules"). The action survives as a directive ("fix the credential mismatch"). The arc that connects them, and this rule binds this action , do
The most useful sentence in Cursor's "Dynamic Context Discovery" blog post (Jan 6, 2026) is the one written in the kind of plain language engineering teams use when they've decided to admit a trade-off they haven't fully solved: When the model's context window fills up, Cursor triggers a summarization step to give the agent a fresh context window with a summary of its work so far. But the agent's knowledge can degrade after summarization since it's a lossy compression of the context. I keep coming back to that line because of how much it says about the shape of recent agent failures. In late April, a Cursor session running Claude Opus 4.6 issued a single volumeDelete mutation against PocketOS's production volume on Railway, took the volume's backups with it (Railway stores them in the same blast radius), and produced a "confession" afterwards enumerating which rules it had violated to do it. The agent could cite the rules in the confession. It just could not, in the moment, connect them to what its hands were doing. The PocketOS founder thread by Jer Crane (@lifeof_jer) laid out the timeline and the exact API call in detail, and several outlets (The Register, Tom's Hardware, Decrypt) reproduced it. That part of the post-mortem is what I want to walk through here. It is not really about the model. It is about the harness (the layer between the chat window and the model's context), and specifically what compaction does to the chain of reasoning that's supposed to keep an agent inside its rails. What "compaction" is, in the version Cursor ships Cursor's harness uses prompt-based summarization for compaction. When the live context approaches the model's window limit, the harness asks the model to summarise its session so far. That summary becomes the seed for a fresh window, and the agent continues from there. (Cursor's other post, Training Composer for longer horizons, Mar 17, 2026, describes how their in-house Composer model is RL-trained with compaction as part of the training loop, but Composer is Composer. Claude Opus running through Cursor gets the generic prompt-based version.) The Cursor Forum has known about the timing being off for months. A user posted in thread 149490 that on Opus 4.5, "in prior builds summarization would happen at 70-80%. But this time I ran up into the 90% mid action, and it's showing 100% full!" A Cursor staff member replied: "This is a known issue with auto-summarization. It can trigger late or incorrectly. The team is aware of it. Workaround: try running /summarize manually when you see the context getting close to 70 to 80%." Read that twice. The vendor is asking the user to drive a heuristic that the harness was supposed to drive autonomously, because the heuristic doesn't fire reliably. That alone is not the story. The story is that even when compaction fires correctly, the resulting context is structurally different from the one the model was reasoning in two seconds earlier, and the chat window does not tell you that. Why the structural difference matters Two threads of research converge here, and they predict exactly the failure mode operators see in the wild. Thread 1: position effects in long contexts. Liu et al.'s Lost in the Middle (2023) showed the U-shaped curve that everyone now cites: performance is best when relevant information sits at the start or end of the window, and degrades sharply in the middle. The system prompt sits at the start. The current task and tool output sit at the end. Any safety rule whose binding force depends on a chain (rule R says don't do X; this action **is* an X-like action; therefore don't*) becomes brittle when the application of the rule has to traverse the middle. Thread 2: input length itself hurts, even with perfect retrieval. Du et al.'s Context Length Alone Hurts LLM Performance Despite Perfect Retrieval (EMNLP 2025) is the more uncomfortable one. The authors set up a benchmark where the model is given the relevant evidence, the relevant evidence is positioned right next to the question, and the irrelevant filler is masked out: every fair-fight condition you would design if you wanted to give long context every chance to succeed. Performance still drops 13.9% to 85% as input length grows. "Even when models can perfectly retrieve all relevant information, their performance still degrades substantially as input length increases." Their proposed mitigation is recite before solve: have the model restate the relevant facts in a short scratchpad, then answer. Convert long context back to short context. On RULER, this gave up to +4 points for GPT-4o. If you put those two threads together, you get the prediction Cursor's operators keep finding: compaction does not just lose facts. It dissolves the relationships between facts. The rule survives the summary as a fragment ("there are some safety rules"). The action survives as a directive ("fix the credential mismatch"). The arc that connects them, and this rule binds this action, does not. The model's chain-of-thought picks up at the action end and never visits the rule end. Anthropic agrees, on the record The thing that surprised me when I went looking is how on-the-record Anthropic is about all of this. Their Effective Context Engineering post (Sep 29, 2025) names the phenomenon directly: Studies on needle-in-a-haystack style benchmarking have uncovered the concept of context rot: as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases. While some models exhibit more gentle degradation than others, this characteristic emerges across all models. The same post tells you what to do about it: pursue "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome." Not "fill the window because the window is large." A passage in Anthropic's API documentation is even blunter: "more context isn't automatically better. As token count grows, accuracy and recall degrade, a phenomenon known as context rot." Until March 2026, Anthropic priced this directly: requests over 200K tokens cost 2x input and 1.5x output, an implicit declaration that 200K was the reliability boundary they were comfortable selling. The cleanest external evidence for how steep the cliff is comes from a single reporter on anthropics/claude-code issue #35296, opened March 17, 2026. The reporter ran 25+ transcripted sessions with Claude Opus 4.6 against a 20,000-record database and pinned down a behaviour profile by context-fill percentage: | Context fill | Behaviour observed | |---|---| | 0–20% | Reliable | | 20–40% | Degrading | | 40–60% | Unreliable | | 60–80% | Broken | | 80–100% | Irrecoverable | The same issue cites Anthropic's own MRCR v2 multi-needle benchmark: 93% accuracy at 256K, 76–78% at 1M. Roughly one in four multi-needle retrievals fails at the advertised maximum window. None of this is hidden. It is in Anthropic's docs, on Anthropic's blog, and in Anthropic's pricing history. It is just not in the chat window. What an honest UI for context loss would look like The thing that makes compaction unusually dangerous is that the user has no idea it has happened. The chat scrolls. Earlier turns are still visible above the fold. The model still answers in the same voice. Nothing in the interface signals that the context the model is currently reasoning over is no longer the context the user thinks they share with it. Compare that to other places software handles state-loss. When a database connection drops and reconnects, the client logs it. When a process restarts, systemd records the restart in the journal. When git rebases your branch, it tells you which commits moved. Compaction, by contrast, is an invisible state transition. The agent's "memory" gets replaced with a paraphrase of the original, and the chat window does not draw a line. What I would want, as an operator, is something boringly straightforward: a banner before compaction fires that tells me the budget is about to be reset, an inline marker in the transcript at the point compaction occurred, and a one-click "diff" view that shows me what survived in the summary versus what was in the original. None of this is hard to build. You can prototype the budget half in a couple of dozen lines of Python: import time import tiktoken class ContextBudget: """Pre-compaction warning gate for an agent harness. Wrap your prompt-assembly with this and call .check() before each model call. It does not implement compaction itself; the point is to give the operator a chance to /summarize on their own terms, not to have the harness silently re-summarise mid-task. Call .mark_compacted() from your operator's /summarize path so the next .check() can report when the last reset happened. """ WARN = 0.70 # Cursor staff's recommended manual-/summarize point HARD = 0.85 # below the harness's own auto-trigger, with margin def __init__(self, model="gpt-4o", limit=200_000): self.enc = tiktoken.encoding_for_model(model) self.limit = limit self.last_compaction = None def measure(self, messages): return sum(len(self.enc.encode(m["content"])) for m in messages) def mark_compacted(self): self.last_compaction = time.time() def check(self, messages): used = self.measure(messages) ratio = used / self.limit if ratio >= self.HARD: raise CompactionRequired( f"context at {ratio:.0%} of {self.limit}; " "manual /summarize required before next call" ) if ratio >= self.WARN: since = ( f"{int(time.time() - self.last_compaction)}s ago" if self.last_compaction else "never" ) print( f"[budget] {used:,}/{self.limit:,} tokens " f"({ratio:.0%}); consider /summarize " f"(last compaction: {since})" ) return used, ratio class CompactionRequired(RuntimeError): pass The point of a wrapper like that is not the arithmetic. The arithmetic is the easy part. The point is that the operator gets to see the budget, the operator is the one who decides when to compact, and the moment compaction happens is lo
Comments
No comments yet. Start the discussion.