Local RAG for Claude Code: Semantic Search Over Your Own Project
More than five hundred markdown files.
That’s what the project that drives this website has. ROADMAP.md, ARCHITECTURE.md, CLAUDE.md, CHANGELOG.md, task folders with notes and lessons learned, blog editorial notes, half-complete LinkedIn drafts, voice-to-text blog post concepts that are barely discipherable even by me, memory files from past sessions. Each one holds a piece of the project’s history — a decision, a rationale, a thing that broke and how it got fixed.
Claude Code can’t see any of it unless I point it at the right file — or it reads them on its own, burning tokens on retrieval before the real work starts.
Claude Code isn’t completely amnesiac — it has session memory, it reads CLAUDE.md, and with the right governance documents it can recover a lot of context at session start. For smaller projects, that’s enough. But this website, for example, has over five hundred files of accumulated institutional knowledge, and the gap between “what the agent can reasonably read at startup” and “what the project actually knows” grows wider every week.
So what do you do? You grep. You tell Claude to search for the thing you vaguely remember documenting somewhere. It reads files, scans for keywords, and — sometimes — finds what you need.
But with a large number of text-based files, it often doesn’t. Because grep matches text, not meaning. If the answer uses different words than your question, grep misses it. If the context is spread across three files, grep finds fragments. And every file Claude reads to search for something is tokens spent on retrieval instead of the actual work.
I was spending significant portions of sessions just helping the agent find context it should already have had access to.
The prompt that built it
The Discovery Tax thesis applies to AI development as much as it applies to enterprise projects: the quality of what you build is directly proportional to the quality of what you specified before building started.
I want to show two prompts, because the contrast illustrates something I think a lot of people miss about working with AI agents.
The vague prompt:
“I want to give agents better memory.”
This goes nowhere useful. No constraints, no architecture, no scope. The agent could build anything from a flat JSON file to a Kubernetes-deployed vector database with a React frontend. It would probably pick something in the middle and spend four hours building infrastructure you didn’t need. You’d end up with code that (maybe) works but doesn’t solve your actual problem — because you never described your actual problem.
The prompt I actually used (simplified and compressed into natural language for readability):
I need to enhance the memory capabilities of Claude Code. Since I use Claude Code for more than just writing code — managing tasks, building documentation, maintaining infrastructure — I can generate thousands of files and folders. While they do get archived regularly, digging through them is a token and time sink, and can sometimes prove inaccurate, especially with larger projects.
We will use Ollama embeddings and build a RAG that the agent can use to query the entire project’s files.
The tool must also be able to connect to a local LLM (optional) in order to further reduce token usage when parsing results.
For now, we are going to be focused on TXT and MD files, and will expand as needed.
The difference isn’t length. It’s that the second prompt contains a discovery phase. It names the problem (token waste, inaccurate retrieval across large projects). It specifies the technology (Ollama embeddings, RAG). It defines the integration point (Claude Code, via MCP). It sets constraints (local-first, TXT and MD only). And it draws an explicit scope boundary — “for now” — which tells the agent what’s out of bounds without killing future expansion.
That’s not prompt engineering as a parlor trick. That’s the same discipline you’d apply to a project brief for a human team. The agent doesn’t need a better prompt template. It needs you to finish thinking before you start asking.
I would like to reiterate something, though:
This is the same discipline you’d apply to a project brief for a human team.
You have to know what you are asking for, how to build it, and how to work with the final product and understand how it works before you can get a functional, consistent result.
What pmem does
The flow is simple enough to describe in one sentence: Claude asks a question, pmem finds the answer in your project’s files, and returns it with source citations.
Under the surface:
-
Indexing.
pmem indexwalks your project’s markdown and text files, splits them into semantic chunks using header-aware parsing (a section stays with its heading — it doesn’t get split mid-thought), and embeds each chunk locally usingnomic-embed-textvia Ollama. Chunks are stored in ChromaDB, a file-based vector database that requires no server process. Indexing is incremental: SHA-256 hashes track which files have changed, so subsequent runs only re-embed what’s new. -
Querying. When Claude needs context, it calls the
memory_queryMCP tool with a natural language question. pmem embeds the question using the same model, searches the vector store for semantically similar chunks using cosine similarity (ChromaDB’s default distance metric), and returns the top results with source file paths and relevance scores. Optionally, a local LLM (via any OpenAI-compatible endpoint) synthesizes the chunks into a concise answer before returning it — which saves Claude from processing raw chunks and reduces token usage further, and allows the user to interface with the pmem datastore directly. -
Session rituals. Three slash commands turn memory into a workflow:
/welcomereads governance documents and refreshes the index at session start./sleepupdates governance documents and captures session changes at session end./reindexrefreshes mid-session when files have changed. The index stays current because maintaining it is a side effect of the session workflow, not a separate chore.
No data leaves your machine. No API keys required for core functionality. The entire system runs on Ollama (for embeddings), ChromaDB (for storage), and Python.
Why not just use grep?
The semantic difference matters more than you might think.
Last week, in my Project Management Claude, I needed to find a task related to Salesforce, only the scope of the task had changed considerably since. The PM has somewhere close to 2,500 MD files, and the task folder was named based on the old scope, which I could only partially remember. I had initially asked Claude to try to find that task, so I could extract a Lesson Learned from the result, but Claude struggled, even with the date constraint, and eventually I halted the search and started digging through the (fortunately well-structured) folders myself.
I did find what I was looking for, but I realized that I shouldn’t have to.
That’s the difference between text matching and semantic search. And in a project with hundreds (or thousands!) of files, the questions you ask are almost never phrased the same way as the answers you wrote.
The token savings compound too. I ran the same query — “identify governance-related blog posts” — both ways on this project (500+ markdown files) and asked Claude to estimate the token cost of each approach:
| pmem (index-based) | Fresh search (Explore agent) | |
|---|---|---|
| Results | 18 posts | 11 posts |
| Time | ~20 seconds | ~90 seconds |
| Token cost | ~5,500 | ~20,000–24,000 |
The fresh search cost roughly 4× the tokens (cries in tokens) and found 7 fewer posts. The posts it missed were the ones where governance was a supporting theme rather than the headline — exactly the kind of semantic connection that keyword search can’t make.
The agent’s overhead — its own system prompt, tools, multi-step reasoning — is the hidden cost. It’s worth it for open-ended exploration across a large codebase, but for a targeted retrieval question like “which posts mention governance,” the index was both cheaper and more thorough.
Architecture decisions worth mentioning
No LangChain. Not out of ideology — out of simplicity. pmem is around 2,000 lines of Python. LangChain would have added a dependency tree larger than the project itself, for abstractions I didn’t need. The RAG pipeline is: embed → store → search → (optionally) synthesize. That’s four operations. They don’t need a framework.
ChromaDB over everything else. File-based, no server process, persistent, and the Python API is clean. I considered LanceDB but never formally evaluated it — ChromaDB was already working, file-based, no server process, and the evaluation wasn’t worth the detour. I also considered plain JSON with numpy cosine similarity, which works for small projects but doesn’t scale — brute-force linear scan is O(n) per query, and once you’re past a few hundred chunks the latency adds up fast compared to ANN-indexed alternatives. ChromaDB hit the sweet spot: real vector search without operational overhead.
Header-aware chunking. Most RAG tutorials split text by character count or sentence boundaries. That destroys semantic units. A section titled “Why we chose CloudFront over Fastly” that gets split between two chunks loses meaning in both. pmem’s chunker uses markdown headers as natural split points, with a size-based fallback for sections that are too long, and each size-based chunk also receives a heading_path as well. The heading becomes metadata on each chunk, so search results carry their context.
CWD walk-up for project detection. Same pattern git uses: start in the current directory, walk up until you find a .memory directory. No config file needed to tell pmem where the project root is. pmem init creates the .memory directory, and from that point forward, any subdirectory just works.
The governance connection
pmem isn’t a standalone tool, it’s the persistence layer for a governance methodology that’s been accumulating for months.
The governance documents — CLAUDE.md, ROADMAP.md, ARCHITECTURE.md, CHANGELOG.md — are designed to carry institutional knowledge forward across sessions. They work. But they work by requiring the agent to read them at session start, which means the agent has to know which files to read and those files have to stay within a readable size.
pmem removes that constraint. The agent doesn’t need to read every governance document front-to-back at session start. It reads the critical ones (CLAUDE.md is always first), and for everything else — past task context, historical decisions, lessons learned, archived content — it queries pmem.
The /welcome skill indexes the project before the agent starts working. The /sleep skill captures changes before the session ends. The memory stays current without any manual intervention. It’s cognitive offloading applied to the agent itself: the agent doesn’t hold the project’s history in its context window. It holds it in a searchable index and retrieves what it needs, when it needs it.
The pattern keeps showing up. The same principle that makes human productivity systems work — externalize what you can, retrieve what you need — applies to the agents that are supposed to be helping you.
Setup
Prerequisites: Python 3.11+, Ollama running locally, and the nomic-embed-text model pulled.
pip install pmem-project-memory
ollama pull nomic-embed-text
Initialize any project:
cd ~/your-project
pmem init
pmem index
Install the session skills:
pmem install-skills
Register the MCP server in ~/.claude.json (global) or .mcp.json (per-project). The README has the exact config block.
First index takes a few seconds for small projects, up to a minute for large ones. After that, incremental indexing only re-embeds changed files — typically under a second.
What’s next
Phase 2 is mostly complete: pmem watch for auto-reindexing, global config defaults, one-command skill installation, better error messages. Phase 3 is where it gets interesting — multi-collection support (separate indexes for different content types), non-markdown file support with language-aware chunking, optional image processing and chunking the results, either with Claude or a LocalLM vision-capable model, and pmem diff to show how answers change over time.
The tool is open source, MIT licensed. It exists because I needed it, and I suspect anyone running Claude Code on a project with more than a few dozen files needs it too.
The governance methodology pmem supports: The Governance Documents. The cognitive offloading framework: Cognitive Offloading. The prompt-engineering-as-discovery principle: The Discovery Tax. The full Pass@1 methodology: What Is Pass@1?.
Sources
Vector search & distance metrics
- ChromaDB — Distance Functions — cosine similarity as default distance metric in ChromaDB
- ANN Benchmarks — Aumüller, Bernhardsson & Faithfull. Benchmarks comparing brute-force linear scan against approximate nearest neighbor algorithms (HNSW, IVF, Annoy). The standard reference for why indexed vector search outperforms brute-force at scale.
Get new posts in your inbox
Occasional writing on systems, ADHD, and AI. No cadence pressure.
You're in. I'll send you the next one.