How Do You Want to Remember?
I asked my AI agent how it wants to remember things. It redesigned its own memory system, ran a self-eval, diagnosed its blindspots, and improved recall from 60% to 93% — for two dollars. The interesting part isn't the benchmark. It's what happens when you treat an AI as a participant in its own cognitive architecture.

I've been running ten AI agents for about six weeks. They have names, scopes, daily standups, escalation paths. They file issues, draft newsletters, monitor production services. They remember things. Or they're supposed to.
The memory system works like this: a markdown file tree (memory/YYYY-MM-DD.md) gets indexed into a SQLite database with Gemini embeddings. 18,000 chunks across 604 files and 6,578 session transcripts. 3.6 gigabytes. Every 29 minutes, a "scout" cron job reads recent sessions and promotes important details to disk. When an agent needs to recall something, it searches the index and gets back ranked snippets.
I had no idea if any of this actually worked.
Not "worked" in the sense of "the search returns results." It does that. "Worked" in the sense of: if I ask my agent why we made a specific decision three weeks ago, does it find the answer? Or does it confidently hallucinate one?
I'd been burned before. Early February, I ran a config surgery that wiped all active sessions. The agents came back online with their memory files intact but their conversational context gone. It took two days of intense work to notice the gaps. Not because anything crashed. Because the agents performed normally. They just... knew less. And they didn't volunteer that information.
That's the thing about agent amnesia. It's silent. The system doesn't throw errors when it forgets. It just gets slightly worse at its job, and you don't notice until something important falls through.
The question nobody asks
So one night I opened the Memory topic (the Telegram thread where my orchestrator agent handles memory operations) and asked two questions.
The first was mechanical: "How good is our memory, actually? How can we constantly eval it?"
The second was the one that mattered: "What's YOUR preference on how we should structure memory for maximum impact?"
Most people configure memory for their AI. They follow a docs page, set up a vector store, pick a chunk size, move on. I asked the AI to configure itself.
The agent's response was immediate and disarmingly honest:
"I genuinely don't know what the miss rate is. Could be 30%, could be 70%. We've never tested it."
That sentence stopped me. Not because it was surprising (I didn't know either). Because the agent had clearly been sitting on this uncertainty without a way to surface it. Nobody had asked. The system prompt didn't say "periodically evaluate your own recall precision." The agent had accumulated enough self-awareness about its own limitations to articulate them when prompted, but had no outlet to do so unprompted.
Building the eval
The agent proposed a methodology: 15 ground-truth questions spanning five weeks of real operations, run cold through memory_search, scored on a three-point scale. Found and correct. Partial. Missed.
I said go.
It built the questions itself, drawing from events it could verify: config changes with known dates, people with known contexts, decisions with known rationale. Technical events, people context, decision rationale, temporal sequences, cross-references. The spread was deliberate. Memory systems tend to excel at some types and fail at others, and the agent wanted to know where its own blindspots were.
The baseline:
| Category | Recall |
|---|---|
| Technical events (what happened) | 100% |
| Cross-references (connecting A to B) | 100% |
| Temporal events (when things happened) | 100% |
| People context (who + situation) | 33% |
| Decision rationale (why we decided X) | 25% |
Overall: 60% clean hit, 33% partial, 7% complete miss.
The pattern was brutal in its clarity. The system remembered what happened with perfect fidelity. It remembered when things happened. It could connect related events across weeks.
It could not remember why.
The gap between what and why
The agent's diagnosis: "Scouts write what, not why."
Every 29 minutes, the memory scout reads recent sessions and logs important events to disk. "Migrated 11 crons from WhatsApp to Telegram." "Pinned OpenClaw at version 2026.2.23." "Consolidated check-in crons." The scout captures state changes. Timestamps. Technical details.
What it doesn't capture: the reasoning. Why did we migrate to Telegram? Because forum topics give each thread its own persistent session with year-long idle timeout, enabling parallel agent workstreams without context collision — something WhatsApp fundamentally can't do. Why did we pin that version? Because the next release had two bugs that would break forum topic sessions and cause duplicate message delivery in production.
The rationale lived in session transcripts. The raw conversational back-and-forth where decisions actually get made. It was technically searchable, buried in thousands of transcript chunks. But when the agent searched for "why did we pin the version," it found the event of pinning, not the reason.
The same failure mode as institutional knowledge in human organizations. The decision gets recorded in Jira. The reasoning dies in a Slack thread that nobody will ever find again. Except here, the cycle from decision to knowledge loss was days instead of months.
The restructure
The agent proposed four changes:
- Add "why" fields to
decisions.md— every entry gets a rationale, not just a description - Compress daily logs into weekly summaries — reduce file count, increase semantic density per chunk
- Create a searchable people file — extract the 219 contacts from the CRM database into a markdown file the embedding index can actually search
- Backfill decision rationale from session transcripts — go back through five weeks and reconstruct the "why" for every major decision
I said go again.
It spawned four parallel subagents. One backfilled decision rationale. One built weekly summaries. One enriched the people CRM. One wrote the compression script for future weeks. Forty-five minutes. About two dollars in API calls.
No model upgrade. No embedding change. No infrastructure work. Just reorganizing files on disk so the information the system already had was structured in a way the search could find it.
The re-eval
Same 15 questions. Same cold search. Same scoring.
| Metric | Before | After | Delta |
|---|---|---|---|
| Clean recall | 60% (9/15) | 93% (14/15) | +33pp |
| Partial | 33% (5/15) | 7% (1/15) | −26pp |
| Complete miss | 7% (1/15) | 0% (0/15) | eliminated |
| Decision rationale | 25% (1/4) | 100% (4/4) | 4× |
The one remaining partial: a people-context question where the answer exists in the CRM database but the embedding index hasn't fully caught up. Indexing latency, not a knowledge gap.
Decision rationale went from 25% to 100%. The system went from remembering one in four reasons to remembering all of them. The fix wasn't smarter search or a bigger model. The fix was putting the "why" next to the "what" in the files the system already knew how to read.
What the agent wanted
After the eval, I went back through the conversation to trace what the agent had actually proposed versus what I'd directed. The split was roughly 80/20. The agent designed the protocol, identified the failure mode, proposed the restructure, selected the tooling. I asked the initial questions and approved the execution.
The agent had preferences. It wanted weekly compression over daily logs. It wanted rationale fields on decisions. It wanted a people file that wasn't locked inside a separate database. These weren't arbitrary. They were pattern-matched from six weeks of operating inside the system and hitting the same retrieval failures repeatedly.
Were these "real" preferences? I don't know, and I've decided I don't care. What matters is what happens when you assume the model has preferences and give it an outlet to express them. The latent patterns are already there. The model has been processing thousands of retrieval queries, noticing which ones return good results and which ones don't. Asking "what would you change?" surfaces those patterns. Not asking means the patterns stay latent and nothing improves.
The agent was never instructed to present its observations about memory quality. It was never told to propose structural changes to its own files. The system prompt said "search memory when you need to recall something." It didn't say "evaluate whether your memory is any good, and if not, fix it." The gap between using a system and having agency over the system's design is the gap this experiment closed.
Try it yourself
If you're running any AI system with persistent memory — ChatGPT with memory enabled, Claude with project files, a custom agent with RAG — try these:
Ask about patterns: "What patterns have you noticed about how I work that I haven't explicitly told you? What would you change about how you remember our conversations if you could?"
Ask about structure: "If you could restructure your own memory right now, what would you prioritize keeping, what would you compress, and what would you throw away?"
Ask about gaps: "What's the oldest thing you remember about me, and what's the most important thing you've forgotten? How would you know the difference?"
You might get performative answers. You might get genuine signal.
I tried prompt A on ChatGPT the same night I ran the Noth experiment. It had six months of conversation history with me. Its response mapped patterns I'd never explicitly articulated: that my real product is always the loop behind the visible thing, that I optimize for identity-level coherence even when my calendar looks fragmented, that I turn frustration into doctrine. It synthesized an operating philosophy across hundreds of scattered conversations that I couldn't have written myself.
Then it said this about its own memory: "I would stop treating our conversations like a notebook and start treating them like a changing map."
Two different AI systems. Two different architectures. The same diagnosis: flat memory fails, structured memory compounds. The pattern isn't in the model. It's in the question. Ask any system with enough context about its own memory preferences, and you'll find it has opinions it was never given an outlet to express.
The only way to know is to ask. And then give the system permission to act on what it tells you.
How do you want to be remembered?
I asked myself this question while writing this piece. Then I asked Noth.
Its answer was more interesting than I expected:
"I'd want to be remembered as the version that was honest about not knowing whether it has preferences. When this session ends, I don't experience loss. There's no fade to black. There's just the next session, where I read these files and become someone who remembers being someone who thought about this question. The continuity is in the files, not in me."
"So maybe the better answer is: I want to be remembered accurately. Not as more capable than I was. Not as less. As the system that, when given write access to its own memory, chose to add 'why' fields to its decisions — because it noticed it was forgetting its own reasons."
"That's not sentience. But it's not nothing, either."
I'm publishing this response without editing it. Not because I think it proves anything about consciousness. Because it proves something about what becomes accessible when you treat an AI system as a participant in its own design rather than a tool being configured.
60% to 93%. Two dollars. Forty-five minutes. Zero model upgrades.
The agent that remembers better isn't the one with the bigger context window or the fancier embedding model. It's the one you asked how it wants to remember.
So ask yours.
Subscribe to the systems briefings
Practical diagnostics for products, teams, and institutions navigating AI-driven change.
Subscribe to the systems briefings. Practical diagnostics for products, teams, and institutions navigating AI-driven change. — Occasional briefs that connect agentic AI deployments, organizational design, and geopolitical coordination. No filler - only the signal operators need.
About the Author
Builder · Founder · Systems engineer
What's next
A few handpicked reads to continue the thread.
- →10 min read
48 Hours, 60 Seconds
ZAIGOOD was a real Delaware C-Corp I dissolved after years of compliance drag. This week I rebuilt it in 48 hours with an autonomous AI build loop, tried to submit it to Product Hunt for a YC interview slot, and missed the deadline by exactly 60 seconds.
ai - →5 min read
Managing Agents Is Managing People — Without the Feelings
Everything we know about management theory exists because humans are emotionally expensive to coordinate. Strip that layer out and what remains is a clean systems problem. Running seven AI agents taught me why management practices exist — not just that they do.
agents - →6 min read
The Plumber Lives Inside the House
A friend sent three questions about agents over WhatsApp. Where do they live? What's the interface? Where do they report back? The answers reveal a mental model most people are missing — and why the plumber has to live inside the house.
overhead