DEV Community
Grade 10
2h ago
[NEW] I spent 3 months teaching AI agents my codebase. They forgot by morning. Every. Single. Day.
Every morning for three months, the first thing I did was re-explain myself to my own tools. Not the code. The code was fine. I mean everything around it: what we'd decided last Tuesday, why we'd ruled out that approach, what the sprint was actually about, how we structured things and why. The context that makes work feel continuous instead of like starting a new job every 24 hours. Then one afternoon it tipped over. I watched my agent propose — in detail, with genuine confidence — the exact API structure I'd shot down two weeks earlier. Same reasoning, same blind spots. It had been part of our conversation. It just didn't remember. I closed my laptop and took a walk. This is my first time writing about this publicly, so bear with me. Some context: I'm a backend engineer, about six years in, currently at one of the big tech companies. A few months before all this, leadership had decided we were going all-in on AI coding tools. The memo used words like "operational efficiency" and "headcount optimization." What it meant was: do more with less, and here's a tool to help you justify that. And look — the tools earned it. Claude Code genuinely impressed me. Real code, fast, often better than what I'd have written in the same time. But there was a cost nobody had put a name to yet. Every session was a first date. Twenty minutes of re-orientation, then maybe an hour of flow, then the context window starts creaking and you're watching your agent lose the thread. Every. Single. Day. The productivity gains were real, but so was this invisible overhead. And it kept adding up. I looked for solutions. CLAUDE.md and .cursorrules work great — until they don't. After a couple of months you've got a file full of contradictions and outdated decisions, and nobody's sure what's still true. Cloud memory products want a monthly subscription and your data on their servers. The self-hosted options I found felt like deploying actual infrastructure just to solve a workflow problem. None of it did what I wanted. I didn't want a bigger context window. I wanted something that built an understanding of how I work and got better at it over time. Like a colleague who's been on the project for six months, not a contractor reading onboarding docs for the first time. I couldn't find that. So I started building it. The first version was a bucket. You save something, you find it later. Simple. It worked, until it didn't. After a few weeks I had hundreds of notes and the useful ones were drowning in noise — old decisions I'd reversed, half-thoughts that never went anywhere, preferences I'd changed my mind about. The more I saved, the worse retrieval got. Exactly backwards. Then I noticed something. When I watched the agent pull memories, it kept surfacing the same handful of things — and ignoring most of the rest. The stuff that was actually helping me kept getting used. The noise sat untouched. That observation took me a while to act on. But when I did, it changed how the whole thing worked. I built a feedback loop. Every time a memory gets retrieved, it gets rated — did this actually help? Useful ones score higher. Ones that get pulled but don't contribute start to sink. After enough misses, they quietly disappear. The system doesn't just grow. It filters itself. Six months after that first painful afternoon, the tool I'm using every day barely resembles the bucket I started with. It's not just fuller. It's sharper — the things I verified last month sit higher than a passing thought from week one, and dead ends clean themselves up without me manually pruning anything. Most of the rest came from things annoying me until I fixed them. I added a dashboard when I couldn't tell what was even in the store. Auto-linking when I realized related memories should find each other — if I save something about our API conventions and something about our error-handling patterns, those should probably know about each other. Namespaces when I started running multiple agents and they kept stomping on each other's notes. Nothing was planned. Every feature has a specific frustration behind it. The benchmarks are decent — 96.6% recall in the top 5 results, 84.6% on the first result, at 33ms on an open-source embedding model anyone can run locally. No cloud, no proprietary model, no ongoing fees. But the number I actually track is simpler: does a session today feel better than a session from six months ago? Yes. A lot better. One thing I didn't expect: the agents helped build it. The same tools I was building memory for were also my primary collaborators building it. When search came back noisy, we tuned the weights together. When the tool schema felt clunky, they complained about it through how they used it — wrong calls, workarounds, weirdly phrased queries — and we fixed it. The product got shaped by the things that were breaking for them. I still think that's a little strange. But it worked. It's called Lorekeeper . Apache 2.0, runs locally. pip install lorekee
Every morning for three months, the first thing I did was re-explain myself to my own tools. Not the code. The code was fine. I mean everything around it: what we'd decided last Tuesday, why we'd ruled out that approach, what the sprint was actually about, how we structured things and why. The context that makes work feel continuous instead of like starting a new job every 24 hours. Then one afternoon it tipped over. I watched my agent propose — in detail, with genuine confidence — the exact API structure I'd shot down two weeks earlier. Same reasoning, same blind spots. It had been part of our conversation. It just didn't remember. I closed my laptop and took a walk. This is my first time writing about this publicly, so bear with me. Some context: I'm a backend engineer, about six years in, currently at one of the big tech companies. A few months before all this, leadership had decided we were going all-in on AI coding tools. The memo used words like "operational efficiency" and "headcount optimization." What it meant was: do more with less, and here's a tool to help you justify that. And look — the tools earned it. Claude Code genuinely impressed me. Real code, fast, often better than what I'd have written in the same time. But there was a cost nobody had put a name to yet. Every session was a first date. Twenty minutes of re-orientation, then maybe an hour of flow, then the context window starts creaking and you're watching your agent lose the thread. Every. Single. Day. The productivity gains were real, but so was this invisible overhead. And it kept adding up. I looked for solutions. CLAUDE.md and .cursorrules work great — until they don't. After a couple of months you've got a file full of contradictions and outdated decisions, and nobody's sure what's still true. Cloud memory products want a monthly subscription and your data on their servers. The self-hosted options I found felt like deploying actual infrastructure just to solve a workflow problem. None of it did what I wanted. I didn't want a bigger context window. I wanted something that built an understanding of how I work and got better at it over time. Like a colleague who's been on the project for six months, not a contractor reading onboarding docs for the first time. I couldn't find that. So I started building it. The first version was a bucket. You save something, you find it later. Simple. It worked, until it didn't. After a few weeks I had hundreds of notes and the useful ones were drowning in noise — old decisions I'd reversed, half-thoughts that never went anywhere, preferences I'd changed my mind about. The more I saved, the worse retrieval got. Exactly backwards. Then I noticed something. When I watched the agent pull memories, it kept surfacing the same handful of things — and ignoring most of the rest. The stuff that was actually helping me kept getting used. The noise sat untouched. That observation took me a while to act on. But when I did, it changed how the whole thing worked. I built a feedback loop. Every time a memory gets retrieved, it gets rated — did this actually help? Useful ones score higher. Ones that get pulled but don't contribute start to sink. After enough misses, they quietly disappear. The system doesn't just grow. It filters itself. Six months after that first painful afternoon, the tool I'm using every day barely resembles the bucket I started with. It's not just fuller. It's sharper — the things I verified last month sit higher than a passing thought from week one, and dead ends clean themselves up without me manually pruning anything. Most of the rest came from things annoying me until I fixed them. I added a dashboard when I couldn't tell what was even in the store. Auto-linking when I realized related memories should find each other — if I save something about our API conventions and something about our error-handling patterns, those should probably know about each other. Namespaces when I started running multiple agents and they kept stomping on each other's notes. Nothing was planned. Every feature has a specific frustration behind it. The benchmarks are decent — 96.6% recall in the top 5 results, 84.6% on the first result, at 33ms on an open-source embedding model anyone can run locally. No cloud, no proprietary model, no ongoing fees. But the number I actually track is simpler: does a session today feel better than a session from six months ago? Yes. A lot better. One thing I didn't expect: the agents helped build it. The same tools I was building memory for were also my primary collaborators building it. When search came back noisy, we tuned the weights together. When the tool schema felt clunky, they complained about it through how they used it — wrong calls, workarounds, weirdly phrased queries — and we fixed it. The product got shaped by the things that were breaking for them. I still think that's a little strange. But it worked. It's called Lorekeeper. Apache 2.0, runs locally. pip install lorekeeper-mcp lorekeeper setup lorekeeper Tell your agent to remember something. Come back tomorrow and ask for it. Then come back in a month and notice how differently it finds things. That's the test I actually care about. Not the benchmark number — the feeling of working with something that knows your project. That's what I was trying to build when I took that walk. Give it a try. See if it sticks. GitHub: https://github.com/Jessinra/Lorekeeper Docs: https://jessinra.github.io/Lorekeeper/ Apache 2.0. Built by agents, for agents. Top comments (0)
Comments
No comments yet. Start the discussion.