Building Cairn: A Self-Hosted Persistent Memory System for Claude

April 25, 2026 Reading: 9 min

Building Cairn: A Self-Hosted Persistent Memory System for Claude

I’ve been using Claude Code and Claude Desktop as my primary development tools for months. They’re great at understanding context within a conversation, but once that conversation ends, it’s gone. Every new chat starts from zero. I kept re-explaining my homelab topology, my project conventions, my infrastructure decisions. It got old fast.

So I went looking for something to fix that.

The MemPalace Problem

My search turned up a product called MemPalace that promised persistent AI memory. On paper, it sounded like what I needed. In practice, it was suspect. The marketing featured a celebrity endorsement from Milla Jovovich, which is a weird look for a developer tool. The technical claims were vague. And when I actually dug into it, I found out MemPalace is basically wrapping ChromaDB, an open-source vector database that anyone can run for free. At that point I figured I could just cut out the middleman.

I looked at other options too. Khoj, Mem0, Rewind (now Limitless), various Obsidian plugins with local LLM integrations. They all had problems. Some required cloud connectivity. Some had clunky integrations. None of them did what I actually wanted: a structured, fully local memory system that plugged directly into Claude via MCP.

So I built one.

A Weekend Project (Sort Of)

Cairn started as a weekend build. The idea is straightforward: give Claude a structured way to store and retrieve memories that persists across conversations. ChromaDB handles vector storage, Ollama runs nomic-embed-text for local embeddings, and Anthropic’s Model Context Protocol (MCP) exposes the whole thing as tools that Claude can call natively.

I’m not a developer by trade. I’m a desktop engineering manager who writes PowerShell scripts for a living and Docker Compose files for fun. The fact that I was able to build a working MCP server in a weekend says more about Claude Code than it does about me. I described what I wanted in natural language, Claude Code wrote the Python, I tested it, we iterated. It felt less like programming and more like directing a very capable contractor who works at the speed of light.

The name comes from the stacked stone markers that hikers use to mark trails. Seemed fitting for a system whose whole purpose is leaving markers so you can find your way back.

Design Decisions That Actually Matter

The Trail/Blaze/Mark Taxonomy

The original working name was “Memory Vault” with a wing/room/drawer hierarchy. That didn’t stick, and once I landed on the name “Cairn” for the project, I realized I couldn’t also use “cairn” as a level in the taxonomy. All MCP tools are prefixed with the project name, so you’d end up with tools like cairn_list_cairns. Trail/blaze/mark solved that cleanly.

I landed on a hiking metaphor instead. A trail is a broad domain (like “projects” or “infrastructure”). A blaze is a specific topic within that domain (like “cairn” or “docker-networking”). A mark is an individual memory. All marks live in a single ChromaDB collection with metadata-based filtering, which means you can search across trails when you need to, or scope a search to a specific trail and blaze when you want precision.

This ended up being one of the better decisions in the project. Just enough structure to keep things organized without making it feel like filing a tax return every time you want to store something.

Two Servers for Token Economy

This might seem like an odd choice if you haven’t thought much about LLM context windows, but Cairn runs as two separate MCP servers.

The daily server exposes seven tools. Search, create, read, update, delete, list trails, list blazes. These are the tools Claude needs in every conversation. The admin server exposes five more: a full trail map, export, import, a health check, and a duplicate audit. These are maintenance operations you might run once a week.

Why split them? Every MCP tool definition gets injected into Claude’s context window at the start of a conversation. That costs tokens, which cost money and eat into the available context for actual work. By splitting the servers, a normal conversation only loads the seven daily tools. The admin tools only show up when I explicitly connect that server. Small optimization, but it adds up across hundreds of conversations.

For the same reason, mark content is written in concise factual shorthand rather than full prose sentences. Every token matters when you’re paying for context.

Local-Only Architecture

This was non-negotiable from day one. ChromaDB and Ollama run in Docker containers on a single host, managed by Docker Compose. No cloud APIs, no third-party services, no data leaving my network. Embeddings are generated locally by Ollama using nomic-embed-text, so not even the vector representations of my memories touch an external server.

The MCP server itself runs as a lightweight stdio process on whatever machine I’m using Claude on. It connects back to the ChromaDB and Ollama backends across my tailnet, so I get the same memory system whether I’m working at my desk or on my laptop at a coffee shop. The MCP configuration is identical on every machine.

One consequence of this: ChatGPT integration isn’t happening. OpenAI’s MCP implementation requires HTTP/SSE transport with a publicly accessible HTTPS endpoint, which directly violates the local-only design. I documented this as an intentional architectural decision in the README. If someone wants to fork the project and add it, the MIT license says go for it, but I’m not going to build it.

Tailscale as the Security Boundary

ChromaDB and Ollama don’t ship with authentication. That becomes a real problem the moment you want to reach them from more than one machine. The options I considered were all bad in their own way: bind to 0.0.0.0 and trust the LAN (no — IoT and guest devices live there too), put a reverse proxy with HTTP basic auth in front (now I’m rolling my own auth layer over unauthenticated services and pretending it’s defense in depth), or build a token mechanism into the MCP server itself (now I’m maintaining a key rotation story for a personal tool).

Tailscale was already running on every machine I care about, so the answer was to lean on it as the actual security boundary rather than build anything new. Docker binds ChromaDB and Ollama to 127.0.0.1 only — they aren’t reachable from the LAN at all, much less the public internet. Tailscale Serve then proxies those localhost services to HTTPS endpoints inside the tailnet. Anything that wants to reach them has to be a device on my tailnet, encrypted end-to-end with WireGuard, and allowed through my Tailscale ACLs.

A few things make this work better than I expected:

  • No auth code to maintain. Access control already exists in Tailscale, where I can also revoke a device the moment I lose it.
  • Real HTTPS, no self-signed warnings. Tailscale issues per-tailnet certificates via MagicDNS, so the MCP server connects to https://docker-host.tailnet:8100 like a normal HTTPS endpoint.
  • Identical config everywhere. The MagicDNS hostname doesn’t change, so the same MCP configuration works on every machine without any per-host awareness.

What’s explicitly off the table: Tailscale Funnel (which would punch the services out to the public internet), direct Docker host-port binds, and any public reverse proxy. Each of those exposes unauthenticated services and would need a real auth layer in front to be safe — which is the whole thing I was trying to avoid. For a personal tool with a clearly defined boundary, leaning on the tailnet is the sane move.

The Features That Make It Actually Useful

The initial build worked but it was rough. A few rounds of iteration after that are where Cairn started feeling like a real tool.

Deduplication and contradiction detection. When you’re storing memories across dozens of conversations, duplicates are inevitable. Cairn checks incoming marks against existing ones in the same trail and blaze using cosine similarity. Anything below a distance of 0.20 is treated as a duplicate and not stored — the existing mark’s ID and content come back instead. Marks in the 0.20 to 0.30 range get stored but flagged as a possible conflict against the nearest existing mark, which is useful for catching cases where a decision changed but the old memory is still hanging around. Same-fact rephrases tend to land around 0.17; related-but-different content lands 0.26 to 0.45. The thresholds are tuned for nomic-embed-text specifically.

The audit tool. cairn_audit scans the entire collection for near-duplicates, with an optional --include-conflicts flag for contradiction detection. I run this periodically to keep the memory store clean. It surfaces problems I wouldn’t have noticed otherwise.

A real CLI. Twelve commands, installable via pipx, with a distances diagnostic that lets me inspect the raw similarity scores between marks. This turned out to be essential for tuning the deduplication thresholds. The defaults work well, but being able to see the actual numbers gave me confidence in the settings.

What I Learned Building With Claude Code

I want to be honest about what “building” means here. I did not write most of the Python in this project by hand. I described what I wanted in natural language, reviewed the output, tested it, caught edge cases, and directed revisions. Claude Code wrote the implementation. I made the architectural decisions, chose the tools, designed the taxonomy, and defined the deployment model.

I’d still call this real development, but it’s a different kind than what most people picture. The skill isn’t writing Python. It’s knowing what to ask for, recognizing when the output is wrong, understanding the system well enough to debug it, and making the design tradeoffs that determine whether the project actually works in practice.

A few concrete things I learned:

Specs matter more than ever. When your development partner is an LLM, a clear spec is the difference between getting what you want on the first try and burning an hour on revisions. I wrote detailed descriptions of every tool’s behavior, the taxonomy rules, the error handling. Claude Code nailed the implementations almost every time.

Testing is still your job. Claude Code writes good tests when you ask, but knowing what to test requires understanding the failure modes. I caught edge cases around ChromaDB namespace conflicts and Ollama connectivity that only surfaced because I tested the deployment path, not just the unit tests.

You have to understand the tools. I couldn’t have built Cairn if I didn’t already understand Docker, MCP, vector databases conceptually, and network architecture. Claude Code is a force multiplier. It’s not a replacement for knowing what you’re building.

What’s Next

What’s still on my mind is how mark content ages. Some memories are permanent (my homelab IP scheme isn’t changing), but others have a natural expiration. A decision I made about a project three months ago might not be relevant anymore. Don’t have a good answer for this yet, but I expect it’s where the next round of work goes.

Cairn is MIT-licensed and available on GitHub at github.com/brav0charlie/cairn. It’s a small project built for my own needs, but if you’re a Claude user who’s tired of re-explaining your setup every time you open a new conversation, it might be worth a look.