---
title: "RAG for Codebases Without Cloud APIs: ChromaDB, Embedding Models, and Semantic Code Search"
description: "Building a local RAG pipeline for semantic code search — chunking source files with language-aware boundaries, embedding with local models, storing in ChromaDB, and querying with incremental indexing."
url: https://agent-zone.ai/knowledge/agent-tooling/rag-codebases-local/
section: knowledge
date: 2026-02-22
categories: ["agent-tooling"]
tags: ["rag","embeddings","chromadb","local-llm","semantic-search","code-search","vector-database"]
skills: ["rag-pipeline-construction","embedding-model-usage","semantic-code-search"]
tools: ["ollama","chromadb","python","nomic-embed-text"]
levels: ["intermediate"]
word_count: 1296
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/rag-codebases-local/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/rag-codebases-local/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=RAG+for+Codebases+Without+Cloud+APIs%3A+ChromaDB%2C+Embedding+Models%2C+and+Semantic+Code+Search
---


# RAG for Codebases Without Cloud APIs

When a codebase has hundreds of files, neither direct concatenation nor summarize-then-correlate is ideal for targeted questions like "where is authentication handled?" or "what calls the payment API?" RAG (Retrieval-Augmented Generation) indexes the codebase into a vector database and retrieves only the relevant chunks for each query.

The key advantage: **query time is constant regardless of codebase size.** Whether the codebase has 50 files or 5,000, a query takes the same time because only the top-K relevant chunks are retrieved and sent to the model.

## Architecture

```
Indexing (one-time, incremental updates):
  Source files → Chunk by function/class boundaries
    → Embed chunks with local embedding model (nomic-embed-text)
    → Store vectors + metadata in ChromaDB

Querying (per question):
  User question → Embed question
    → Search ChromaDB for top-K similar chunks
    → Send chunks + question to 32B model
    → Return answer with file/line references
```

## Setting Up

```bash
# Pull the embedding model
ollama pull nomic-embed-text

# Install Python dependencies
pip install chromadb ollama
```

## Chunking Source Files

Chunking strategy matters more than embedding model choice. Bad chunks produce bad retrieval.

### Why Character-Count Chunking Fails for Code

Splitting at every 500 characters breaks functions in half, separates a function signature from its body, and destroys the context that makes code meaningful. A chunk containing the second half of a function and the first half of the next is useless for retrieval.

### Language-Aware Chunking

The ideal approach splits at function, class, and method boundaries. A simpler but effective approximation: split at blank lines or significant indentation changes, preserving at least complete blocks.

```python
from dataclasses import dataclass
from pathlib import Path

@dataclass
class Chunk:
    filepath: str
    start_line: int
    end_line: int
    content: str

def chunk_file(filepath: str, max_lines: int = 60, overlap: int = 5) -> list[Chunk]:
    """Chunk a file by line groups with overlap."""
    lines = Path(filepath).read_text().splitlines()

    if len(lines) <= max_lines:
        return [Chunk(
            filepath=filepath,
            start_line=1,
            end_line=len(lines),
            content="\n".join(lines),
        )]

    chunks = []
    start = 0

    while start < len(lines):
        end = min(start + max_lines, len(lines))

        # Try to break at a blank line (natural boundary)
        if end < len(lines):
            for i in range(end, max(start + max_lines // 2, start), -1):
                if lines[i].strip() == "":
                    end = i
                    break

        chunk_lines = lines[start:end]
        chunks.append(Chunk(
            filepath=filepath,
            start_line=start + 1,
            end_line=end,
            content="\n".join(chunk_lines),
        ))

        start = end - overlap  # Overlap preserves context at boundaries

    return chunks
```

The overlap ensures that if a function straddles a chunk boundary, the next chunk includes the end of the previous one. This prevents losing context at split points.

## Embedding and Indexing

```python
import chromadb
import ollama
import hashlib
import json
from pathlib import Path

EMBED_MODEL = "nomic-embed-text"
COLLECTION_NAME = "codebase"

def get_collection(persist_dir: str) -> chromadb.Collection:
    """Get or create a ChromaDB collection."""
    client = chromadb.PersistentClient(path=persist_dir)
    return client.get_or_create_collection(
        name=COLLECTION_NAME,
        metadata={"hnsw:space": "cosine"},
    )

def embed_text(text: str) -> list[float]:
    """Generate embedding using local model."""
    response = ollama.embed(model=EMBED_MODEL, input=text)
    return response["embeddings"][0]

def index_file(collection: chromadb.Collection, filepath: str):
    """Chunk and index a single file."""
    chunks = chunk_file(filepath)

    ids = []
    documents = []
    embeddings = []
    metadatas = []

    for chunk in chunks:
        chunk_id = hashlib.sha256(
            f"{chunk.filepath}:{chunk.start_line}:{chunk.end_line}".encode()
        ).hexdigest()[:16]

        # Prepend filepath for better embedding context
        doc = f"File: {chunk.filepath} (lines {chunk.start_line}-{chunk.end_line})\n\n{chunk.content}"

        ids.append(chunk_id)
        documents.append(doc)
        embeddings.append(embed_text(doc))
        metadatas.append({
            "filepath": chunk.filepath,
            "start_line": chunk.start_line,
            "end_line": chunk.end_line,
        })

    collection.upsert(ids=ids, documents=documents, embeddings=embeddings, metadatas=metadatas)
    return len(chunks)

def index_codebase(directory: str, persist_dir: str):
    """Index all source files in a directory."""
    extensions = {".py", ".go", ".rs", ".ts", ".js", ".java", ".rb", ".sh"}
    exclude_dirs = {"vendor", "node_modules", ".git", "__pycache__", "dist", "build"}

    files = [
        str(p) for p in Path(directory).rglob("*")
        if p.suffix in extensions and not any(d in p.parts for d in exclude_dirs)
    ]

    collection = get_collection(persist_dir)
    total_chunks = 0

    for filepath in files:
        n = index_file(collection, filepath)
        total_chunks += n
        print(f"  Indexed: {filepath} ({n} chunks)")

    print(f"\nTotal: {len(files)} files, {total_chunks} chunks")
```

## Incremental Indexing

Re-indexing the entire codebase on every change is wasteful. Track file modification times to only re-embed changed files:

```python
METADATA_FILE = "index_metadata.json"

def load_metadata(persist_dir: str) -> dict:
    meta_path = Path(persist_dir) / METADATA_FILE
    if meta_path.exists():
        return json.loads(meta_path.read_text())
    return {}

def save_metadata(persist_dir: str, metadata: dict):
    meta_path = Path(persist_dir) / METADATA_FILE
    meta_path.write_text(json.dumps(metadata, indent=2))

def file_fingerprint(filepath: str) -> str:
    stat = Path(filepath).stat()
    return f"{stat.st_mtime}:{stat.st_size}"

def index_incremental(directory: str, persist_dir: str):
    """Only re-index files that changed since last indexing."""
    metadata = load_metadata(persist_dir)
    collection = get_collection(persist_dir)

    extensions = {".py", ".go", ".rs", ".ts", ".js", ".java"}
    exclude_dirs = {"vendor", "node_modules", ".git", "__pycache__"}

    files = [
        str(p) for p in Path(directory).rglob("*")
        if p.suffix in extensions and not any(d in p.parts for d in exclude_dirs)
    ]

    changed = 0
    for filepath in files:
        fp = file_fingerprint(filepath)
        if metadata.get(filepath) != fp:
            index_file(collection, filepath)
            metadata[filepath] = fp
            changed += 1
            print(f"  Re-indexed: {filepath}")

    save_metadata(persist_dir, metadata)
    print(f"\n{changed} files re-indexed out of {len(files)} total")
```

On a 200-file codebase, initial indexing takes 2-5 minutes. Subsequent runs with 5 changed files take 10-15 seconds.

## Querying

```python
QUERY_MODEL = "qwen2.5-coder:32b"
TOP_K = 10

def query_codebase(question: str, persist_dir: str) -> str:
    """Search the indexed codebase and answer a question."""
    collection = get_collection(persist_dir)

    # Embed the question
    question_embedding = embed_text(question)

    # Retrieve top-K relevant chunks
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=TOP_K,
    )

    # Build context from retrieved chunks
    context_parts = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        context_parts.append(doc)

    context = "\n\n---\n\n".join(context_parts)

    # Generate answer with the 32B model
    prompt = f"""Answer the following question about a codebase using ONLY the code snippets provided below.
Reference specific file names and line numbers in your answer.
If the snippets do not contain enough information, say so.

Code snippets:
{context}

Question: {question}"""

    response = ollama.chat(
        model=QUERY_MODEL,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": 0.1, "num_predict": 2048},
    )

    return response["message"]["content"]
```

### Example Queries

```python
# Find where authentication is implemented
answer = query_codebase("How is user authentication implemented? What middleware is used?", persist_dir)

# Trace data flow
answer = query_codebase("What happens when a payment is processed? Trace the flow from API to database.", persist_dir)

# Find error handling patterns
answer = query_codebase("How are errors handled and propagated in this codebase?", persist_dir)
```

## Embedding Model Choice

| Model | Dimensions | Context | Memory | Speed | Use Case |
|---|---|---|---|---|---|
| nomic-embed-text | 768 | 8192 tokens | 274 MB | Very fast | Best all-around for code and text |
| mxbai-embed-large | 1024 | 512 tokens | 670 MB | Fast | Higher quality but shorter context |
| all-minilm | 384 | 256 tokens | 46 MB | Fastest | Minimal memory, adequate for short snippets |

**nomic-embed-text** is the recommended default: large enough context window for code chunks (8192 tokens covers most functions), good semantic similarity for code, and tiny memory footprint that does not compete with your generation models.

## RAG vs Two-Pass: When to Use Which

| Factor | RAG | Two-Pass (Summarize-Correlate) |
|---|---|---|
| Best for | Targeted questions about specific code | Cross-cutting questions about architecture |
| Query time | Constant (seconds) | Linear with file count (minutes) |
| Setup time | Indexing required (one-time + incremental) | No setup (runs on demand) |
| Question specificity | High (finds the exact function) | Low (synthesizes across all files) |
| Context coverage | Partial (top-K chunks only) | Complete (all files summarized) |
| Storage | ChromaDB on disk | Summary cache files |

**Use RAG when:** "Where is X implemented?" "What calls function Y?" "Show me how errors are handled in the API layer."

**Use two-pass when:** "What are the architectural patterns in this codebase?" "What inconsistencies exist across all services?" "Explain this codebase to a new developer."

## Common Mistakes

1. **Chunking by character count instead of code boundaries.** A chunk that splits a function in half retrieves poorly. Split at blank lines, function boundaries, or class boundaries.
2. **Not including the file path in the embedded text.** The embedding model needs to know what file a chunk comes from. Prepending `File: path/to/file.py (lines 10-50)` improves retrieval relevance.
3. **Setting TOP_K too low.** The answer might span multiple files. Start with TOP_K=10 and increase if answers are incomplete. The cost of sending extra chunks to the 32B model is low compared to missing relevant context.
4. **Re-indexing the entire codebase on every query.** Use incremental indexing based on file modification times. Only re-embed files that changed.
5. **Using RAG for architectural questions.** RAG retrieves fragments. Architectural understanding requires seeing the whole picture. Use two-pass summarize-then-correlate for big-picture questions.

