Proposal: Use semantic compression as input diffusion to read sessions larger than the context window [R]
Problem: Long AI Sessions and Non-Local Information
I've been trying to come up with a solution for keeping extremely long AI sessions coherent. Sometimes there is too much substance to risk compaction. With so much buzz around diffusion going on it got me thinking: what if we treat the context like a progressive render, blurry → sharp?
The practical way to make text "blurry" is compression. This is a "diffusion inspired" system which borrows the coarse-to-fine process, not the formal math. It uses semantic compression so the overall structure of the session stays intact.
Proposed Workflow
Read the compressed version first to build an outline. Then read progressively less compressed slices until you're reading small verbatim chunks that give full detail. You're basically using compression as noise on the input side, then progressively building an output.
Each slice is compressed to fit within the context window, so the model only ever needs to read the current slice + input + current output. Tell the model what pass it's on, so it knows whether to write an outline or add detail.
What This Preserves
The thing I'm actually trying to preserve is what you'd call "non-local information." Think of it as stuff that surfaces when looking at the whole session and doesn't survive fragmented retrieval. Retrieval misses it, compaction deletes it. Both miss what only exists in a holistic view.
Visual Demonstration
Here is a visual demonstration to get a general idea of the workflow:
https://dev-boz.github.io/diffusive-semantic-compression/demo/architecture-demo.html
Prior Art and Novelty
There is substantial overlap with lots of prior art. Recursive Language Models is one of the closest (source and output on disk, process recursively). I wrote most of this before I found RLM and nearly gave up before realising there was still a small part that was novel. As far as I can tell there's no exact match for this particular implementation. Please let me know if I've missed one.
The difference to regular masked diffusion is in changing the length of the input rather than just masking. What seems to be new ground is using compression as noise and a position-aware process.
Testing Results
I've done some basic testing, mainly to see if it was at all viable. Just some basic tests using small models like Qwen2.5 7B. The untrained models show that they can do each part (outline, refine, add detail) but they struggle with the full end-to-end process. There's occasional end-to-end success, but it's nowhere near reliable. On untrained models it also hasn't yet beaten a cheap dense read of the same document.
The main bet is whether position-aware training changes that. I haven't been able to test that yet. I've published all the pre-registered failures, parser bugs I found, etc.
Evaluation Notes
Another note: the goal is preserving structure and nuance, but the tests so far measure planted facts and split-up numeric composition. Mainly because the experiments needed answers you can actually score. The nuance evaluation is being designed but isn't ready yet.
Next Steps and Collaboration
The next step is a small model fine tune to test if position-aware training can help. If you have the time to look at the idea, it really needs a prior art check from anyone who knows the diffusion-LM/long-context space. And if anyone wanted to help expand the idea or contribute with compute or collaboration for the fine-tune, please do.
Here is the repo for the proposal. Links to testing repo and prior art inside:
https://github.com/dev-boz/diffusive-semantic-compression
Comments
No comments yet. Start the discussion.