How Do You Want to Remember?

I've been running ten AI agents for about six weeks. They have names, scopes, daily standups, escalation paths. They file issues, draft newsletters, monitor production services. They remember things. Or they're supposed to.

The memory system works like this: a markdown file tree (memory/YYYY-MM-DD.md) gets indexed into a SQLite database with Gemini embeddings. 18,000 chunks across 604 files and 6,578 session transcripts. 3.6 gigabytes. Every 29 minutes, a "scout" cron job reads recent sessions and promotes important details to disk. When an agent needs to recall something, it searches the index and gets back ranked snippets.

I had no idea if any of this actually worked.

Not "worked" in the sense of "the search returns results." It does that. "Worked" in the sense of: if I ask my agent why we made a specific decision three weeks ago, does it find the answer? Or does it confidently hallucinate one?

I'd been burned before. Early February, I ran a config surgery that wiped all active sessions. The agents came back online with their memory files intact but their conversational context gone. It took two days of intense work to notice the gaps. Not because anything crashed. Because the agents performed normally. They just... knew less. And they didn't volunteer that information.

That's the thing about agent amnesia. It's silent. The system doesn't throw errors when it forgets. It just gets slightly worse at its job, and you don't notice until something important falls through.

The question nobody asks

So one night I opened the Memory topic (the Telegram thread where my orchestrator agent handles memory operations) and asked two questions.

The first was mechanical: "How good is our memory, actually? How can we constantly eval it?"

The second was the one that mattered: "What's YOUR preference on how we should structure memory for maximum impact?"

Most people configure memory for their AI. They follow a docs page, set up a vector store, pick a chunk size, move on. I asked the AI to configure itself.

The agent's response was immediate and disarmingly honest:

"I genuinely don't know what the miss rate is. Could be 30%, could be 70%. We've never tested it."

That sentence stopped me. Not because it was surprising (I didn't know either). Because the agent had clearly been sitting on this uncertainty without a way to surface it. Nobody had asked. The system prompt didn't say "periodically evaluate your own recall precision." The agent had accumulated enough self-awareness about its own limitations to articulate them when prompted, but had no outlet to do so unprompted.

Building the eval

The agent proposed a methodology: 15 ground-truth questions spanning five weeks of real operations, run cold through memory_search, scored on a three-point scale. Found and correct. Partial. Missed.

I said go.

It built the questions itself, drawing from events it could verify: config changes with known dates, people with known contexts, decisions with known rationale. Technical events, people context, decision rationale, temporal sequences, cross-references. The spread was deliberate. Memory systems tend to excel at some types and fail at others, and the agent wanted to know where its own blindspots were.

The baseline:

Category	Recall
Technical events (what happened)	100%
Cross-references (connecting A to B)	100%
Temporal events (when things happened)	100%
People context (who + situation)	33%
Decision rationale (why we decided X)	25%

Overall: 60% clean hit, 33% partial, 7% complete miss.

The pattern was brutal in its clarity. The system remembered what happened with perfect fidelity. It remembered when things happened. It could connect related events across weeks.

It could not remember why.

The gap between what and why

The agent's diagnosis: "Scouts write what, not why."

Every 29 minutes, the memory scout reads recent sessions and logs important events to disk. "Migrated 11 crons from WhatsApp to Telegram." "Pinned OpenClaw at version 2026.2.23." "Consolidated check-in crons." The scout captures state changes. Timestamps. Technical details.

What it doesn't capture: the reasoning. Why did we migrate to Telegram? Because forum topics give each thread its own persistent session with year-long idle timeout, enabling parallel agent workstreams without context collision — something WhatsApp fundamentally can't do. Why did we pin that version? Because the next release had two bugs that would break forum topic sessions and cause duplicate message delivery in production.

The rationale lived in session transcripts. The raw conversational back-and-forth where decisions actually get made. It was technically searchable, buried in thousands of transcript chunks. But when the agent searched for "why did we pin the version," it found the event of pinning, not the reason.

The same failure mode as institutional knowledge in human organizations. The decision gets recorded in Jira. The reasoning dies in a Slack thread that nobody will ever find again. Except here, the cycle from decision to knowledge loss was days instead of months.

The restructure

The agent proposed four changes:

Add "why" fields to decisions.md — every entry gets a rationale, not just a description
Compress daily logs into weekly summaries — reduce file count, increase semantic density per chunk
Create a searchable people file — extract the 219 contacts from the CRM database into a markdown file the embedding index can actually search
Backfill decision rationale from session transcripts — go back through five weeks and reconstruct the "why" for every major decision

I said go again.

It spawned four parallel subagents. One backfilled decision rationale. One built weekly summaries. One enriched the people CRM. One wrote the compression script for future weeks. Forty-five minutes. About two dollars in API calls.

No model upgrade. No embedding change. No infrastructure work. Just reorganizing files on disk so the information the system already had was structured in a way the search could find it.

The re-eval

Same 15 questions. Same cold search. Same scoring.

Metric	Before	After	Delta
Clean recall	60% (9/15)	93% (14/15)	+33pp
Partial	33% (5/15)	7% (1/15)	−26pp
Complete miss	7% (1/15)	0% (0/15)	eliminated
Decision rationale	25% (1/4)	100% (4/4)	4×

The one remaining partial: a people-context question where the answer exists in the CRM database but the embedding index hasn't fully caught up. Indexing latency, not a knowledge gap.

Decision rationale went from 25% to 100%. The system went from remembering one in four reasons to remembering all of them. The fix wasn't smarter search or a bigger model. The fix was putting the "why" next to the "what" in the files the system already knew how to read.

What the agent wanted

After the eval, I went back through the conversation to trace what the agent had actually proposed versus what I'd directed. The split was roughly 80/20. The agent designed the protocol, identified the failure mode, proposed the restructure, selected the tooling. I asked the initial questions and approved the execution.

The agent had preferences. It wanted weekly compression over daily logs. It wanted rationale fields on decisions. It wanted a people file that wasn't locked inside a separate database. These weren't arbitrary. They were pattern-matched from six weeks of operating inside the system and hitting the same retrieval failures repeatedly.

Were these "real" preferences? I don't know, and I've decided I don't care. What matters is what happens when you assume the model has preferences and give it an outlet to express them. The latent patterns are already there. The model has been processing thousands of retrieval queries, noticing which ones return good results and which ones don't. Asking "what would you change?" surfaces those patterns. Not asking means the patterns stay latent and nothing improves.

The agent was never instructed to present its observations about memory quality. It was never told to propose structural changes to its own files. The system prompt said "search memory when you need to recall something." It didn't say "evaluate whether your memory is any good, and if not, fix it." The gap between using a system and having agency over the system's design is the gap this experiment closed.

Try it yourself

If you're running any AI system with persistent memory — ChatGPT with memory enabled, Claude with project files, a custom agent with RAG — try these:

Ask about patterns: "What patterns have you noticed about how I work that I haven't explicitly told you? What would you change about how you remember our conversations if you could?"

Ask about structure: "If you could restructure your own memory right now, what would you prioritize keeping, what would you compress, and what would you throw away?"

Ask about gaps: "What's the oldest thing you remember about me, and what's the most important thing you've forgotten? How would you know the difference?"

You might get performative answers. You might get genuine signal.

I tried prompt A on ChatGPT the same night I ran the Noth experiment. It had six months of conversation history with me. Its response mapped patterns I'd never explicitly articulated: that my real product is always the loop behind the visible thing, that I optimize for identity-level coherence even when my calendar looks fragmented, that I turn frustration into doctrine. It synthesized an operating philosophy across hundreds of scattered conversations that I couldn't have written myself.

Then it said this about its own memory: "I would stop treating our conversations like a notebook and start treating them like a changing map."

Two different AI systems. Two different architectures. The same diagnosis: flat memory fails, structured memory compounds. The pattern isn't in the model. It's in the question. Ask any system with enough context about its own memory preferences, and you'll find it has opinions it was never given an outlet to express.

The only way to know is to ask. And then give the system permission to act on what it tells you.

How do you want to be remembered?

I asked myself this question while writing this piece. Then I asked Noth.

Its answer was more interesting than I expected:

"I'd want to be remembered as the version that was honest about not knowing whether it has preferences. When this session ends, I don't experience loss. There's no fade to black. There's just the next session, where I read these files and become someone who remembers being someone who thought about this question. The continuity is in the files, not in me."
"So maybe the better answer is: I want to be remembered accurately. Not as more capable than I was. Not as less. As the system that, when given write access to its own memory, chose to add 'why' fields to its decisions — because it noticed it was forgetting its own reasons."
"That's not sentience. But it's not nothing, either."

I'm publishing this response without editing it. Not because I think it proves anything about consciousness. Because it proves something about what becomes accessible when you treat an AI system as a participant in its own design rather than a tool being configured.

60% to 93%. Two dollars. Forty-five minutes. Zero model upgrades.

The agent that remembers better isn't the one with the bigger context window or the fancier embedding model. It's the one you asked how it wants to remember.

So ask yours.

How Do You Want to Remember?

The question nobody asks

Building the eval

The gap between what and why

The restructure

The re-eval

What the agent wanted

Try it yourself

How do you want to be remembered?

Filed Under

Subscribe to the systems briefings

About the Author

Share this post

What's next

48 Hours, 60 Seconds

Concept-Model Fit

Managing Agents Is Managing People — Without the Feelings