---
title: "Single-Node Kubernetes Disaster Recovery: Backups That Survive a Wiped Docker VM"
description: "Disaster recovery for a single-node minikube cluster on Docker Desktop — where Velero is overkill, etcd snapshots live in the failure domain, and the only useful backup lives on a different disk."
url: https://agent-zone.ai/knowledge/sre/single-node-kubernetes-disaster-recovery/
section: knowledge
date: 2026-05-07
categories: ["sre"]
tags: ["disaster-recovery","minikube","docker-desktop","backup","single-node","gitea","self-hosted","homelab"]
skills: ["single-node-dr-planning","git-mirror-backup","cron-safe-shell-scripting","external-disk-backup-rotation"]
tools: ["minikube","docker-desktop","git","tar","cron","kubectl","shasum"]
levels: ["intermediate"]
word_count: 2601
formats:
  json: https://agent-zone.ai/knowledge/sre/single-node-kubernetes-disaster-recovery/index.json
  html: https://agent-zone.ai/knowledge/sre/single-node-kubernetes-disaster-recovery/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Single-Node+Kubernetes+Disaster+Recovery%3A+Backups+That+Survive+a+Wiped+Docker+VM
---


A single-node minikube cluster on Docker Desktop runs the entire control plane, kubelet, every PVC, every Secret, and the container image cache inside one VM whose disk is **one file**: `~/Library/Containers/com.docker.docker/Data/vms/0/data/Docker.raw` on macOS. When that file is lost or corrupted, every piece of cluster state goes with it in a single event. There is no "node failure vs storage failure" distinction to design around. Every backup strategy that assumes those are separable does not apply.

This article is the single-node companion to [Kubernetes Disaster Recovery](../../kubernetes/kubernetes-disaster-recovery/), which assumes multi-node, etcd-on-disk, and an off-cluster object store. None of those assumptions hold here. For host-level setup that *creates* this failure domain, see [Kubernetes on Apple Silicon Setup Gotchas](../../kubernetes/kubernetes-on-apple-silicon-setup-gotchas/).

## The failure domain that breaks every "in-cluster" backup tool

The single-VM substrate has three consequences that constrain every choice downstream.

**etcd snapshots stored on a hostPath PV are inside the failure domain.** They are a slightly newer copy of the thing being recovered from. They die with the cluster.

**Velero with its default in-cluster MinIO backend is also inside the failure domain.** To do anything useful, Velero needs a remote bucket — S3, B2, GCS — at which point there is an off-cluster dependency, an IAM key on disk, a recurring cost line, and a Helm chart plus a CRD plus a controller pod, all to back up a homelab. Velero is the right tool for multi-node clusters where node failure and storage failure are independent. On single-node, the cost-benefit shifts.

**Every PVC, every Secret, every ConfigMap, every container image lives in the same `Docker.raw` file.** A backup strategy that captures only one class is a partial backup. The honest framing: pick what's worth backing up out of the VM, accept that everything else rebuilds from empty.

## Design decisions

### Back up source repos, not PVCs

Source is small (around 26 MB per day for a 20-repo set), trivially restorable, and contains the *intent* of every service. PVC contents — Postgres state, message history, Mattermost uploads — are large, change constantly, and require app-aware dump tooling per service. Accepting "rebuild Postgres and Mattermost from empty on restore" is a defensible posture for a single-node lab cluster as long as it's explicit.

A self-hosted Git forge (Gitea, Forgejo) running on the cluster is itself in the failure domain. Back it up *as repos*, not as a PVC: the repos are the recoverable artifact; the Gitea database is auth state, webhook secrets, and per-user metadata that's faster to recreate than to restore.

### External drive as primary, cloud as optional second tier

A USB or Thunderbolt drive survives Docker Desktop wipes, host OS reinstalls, and Docker corruption events. It has no recurring cost and no credentials to leak. The drive itself is a single point of failure — mitigate either by rotating two drives weekly or by adding `restic`/`rclone` to off-site object storage as a second tier. Cloud-as-primary is the wrong default for a homelab: if the nightly backup costs money, it gets cancelled within a quarter; if it lives on an external drive on the desk, it survives the next reorg.

### Mirror clone, not PVC tarball

`git clone --mirror` is a verifiable byte-for-byte copy of every ref — branches, tags, PR refs, notes. Git's own integrity model does the verification. Restore is `git push --mirror` — a Git primitive, not a tool-specific import. A `tar` of the Gitea data PVC is fragile across forge versions, leaks auth tokens and webhook secrets into the backup blast radius, and locks the restore target to the same forge. A mirror clone restores cleanly to a fresh Gitea, to Forgejo, or to GitHub.

### Seven-day daily retention, not GFS

Daily snapshots with seven-day retention catch "a secret was committed three days ago and force-pushed over since" recovery scenarios. Beyond a week the Git history itself is the backup; older snapshots are mostly identical and waste drive space.

### Cap Docker Desktop memory explicitly

Docker Desktop auto-allocates around 60% of host RAM. Under workload on a 64 GiB host, that pushes macOS into jetsam territory and `com.docker.backend` is the highest-memory process — macOS kills it. Repeated SIGKILL of the backend is what corrupts the Data folder. **The DR event is preventable.** Cap at roughly 38% of host RAM (24 GiB on 64 GiB) via Docker Desktop → Settings → Resources, or via `~/Library/Group Containers/group.com.docker/settings-store.json` with `MemoryMiB: 24576`. Memory cap is the actual fix; renaming the Data folder is a red herring.

## What the backup script does

The mechanism, step by step. Each step exists to defeat a specific failure mode that bites cron-scheduled backups on macOS.

1. **`set -euo pipefail`.** Fail fast. No silent partial backups.
2. **Cron-safe environment.** Explicit absolute paths to `kubectl`, `git`, `tar`, `curl`, `python3`, plus an explicit `PATH=/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin`. Cron's stripped `PATH` is the single most common reason nightly backups silently no-op.
3. **Self-bootstrapping port-forward.** Ping the service. If unreachable, run `kubectl port-forward` in the background, capture the PID, register a `trap cleanup EXIT` to kill it on exit. Removes the dependency on a separate `make port-forward` shell the operator has to remember to leave running.
4. **Enumerate via API, not a hardcoded list.** For Gitea: `GET /api/v1/repos/search?owner=$OWNER&limit=50`. New repos get backed up automatically the next night. Hardcoded lists drift.
5. **Mirror clone, not regular clone.** `git clone --mirror` preserves every ref. A plain `git clone` gets the default branch plus remote-tracking branches, and a restore silently loses tags and PR refs.
6. **Per-repo tarball.** `tar -czf <DEST>/<YYYY-MM-DD>/<repo>.tgz`. Per-repo (not one big tarball) so a single repo restores without extracting the rest, and a single corrupt repo doesn't poison the whole snapshot.
7. **MANIFEST.txt per day.** Tab-separated `<repo>\t<size_bytes>\t<sha256>\t<HEAD_ref>\t<HEAD_commit>`. Verifies integrity without extracting (`shasum -a 256`), confirms the right HEAD on restore, detects bit-rot on the backup drive.
8. **Day-directory rotation.** Loop over `$DEST_ROOT/2*`, compute age via `stat -f %m` (BSD/macOS) with a `stat -c %Y` (GNU) fallback, `rm -rf` if older than `RETENTION_DAYS`. Portable between macOS and Linux operators.
9. **Logs separate from snapshots.** `<DEST>/logs/backup-<date>.log`, retained 30 days. When the script silently fails in cron context, the log directory is the first place to look — and logs need to outlive the snapshots.
10. **Non-zero exit on any failure.** `exit 4` if any single repo failed. Cron surfaces this in the local mail spool or stderr capture, so the operator notices.

A redacted, templated form of the script:

```bash
#!/usr/bin/env bash
# Nightly mirror-clone backup of every repo owned by $OWNER on a self-hosted
# Gitea instance, to <DEST_ROOT>/<YYYY-MM-DD>/<repo>.tgz with a sha256 manifest.
set -euo pipefail

# --- absolute paths (cron has a stripped PATH) -----------------------------
export PATH=/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
KUBECTL=/opt/homebrew/bin/kubectl
GIT=/usr/bin/git
TAR=/usr/bin/tar
CURL=/usr/bin/curl
PY=/usr/bin/python3

# --- config (override via env) ---------------------------------------------
OWNER="${OWNER:-<owner>}"
GITEA_USER="${GITEA_USER:-<admin-user>}"
GITEA_PASS="${GITEA_PASS:-<admin-pass>}"
GITEA_HOST="${GITEA_HOST:-localhost:3000}"
GITEA_SVC_NS="${GITEA_SVC_NS:-<gitea-namespace>}"
GITEA_SVC="${GITEA_SVC:-<gitea-svc>}"
DEST_ROOT="${DEST_ROOT:-/Volumes/<your-backup-drive>/gitea-backups}"
RETENTION_DAYS="${RETENTION_DAYS:-7}"
LOG_RETENTION_DAYS="${LOG_RETENTION_DAYS:-30}"

DATE=$(date +%Y-%m-%d)
DEST="$DEST_ROOT/$DATE"
LOG_DIR="$DEST_ROOT/logs"
LOG_FILE="$LOG_DIR/backup-$DATE.log"
mkdir -p "$DEST" "$LOG_DIR"
exec > >(tee -a "$LOG_FILE") 2>&1

# --- self-bootstrap port-forward if needed ---------------------------------
PF_PID=""
cleanup() { [[ -n "$PF_PID" ]] && kill "$PF_PID" 2>/dev/null || true; }
trap cleanup EXIT

if ! "$CURL" -sf "http://$GITEA_HOST/api/v1/version" >/dev/null 2>&1; then
  "$KUBECTL" -n "$GITEA_SVC_NS" port-forward "svc/$GITEA_SVC" 3000:3000 \
    >>"$LOG_FILE" 2>&1 &
  PF_PID=$!
  sleep 3
fi

# --- enumerate repos via API ------------------------------------------------
echo "--- listing repos owned by $OWNER ---"
mapfile -t REPOS < <(
  "$CURL" -sf -u "$GITEA_USER:$GITEA_PASS" \
    "http://$GITEA_HOST/api/v1/repos/search?owner=$OWNER&limit=50" \
    | "$PY" -c 'import json,sys
for r in json.load(sys.stdin)["data"]: print(r["name"])'
)
echo "found ${#REPOS[@]} repos"

MANIFEST="$DEST/MANIFEST.txt"
: > "$MANIFEST"
FAIL=0
TOTAL=0

for repo in "${REPOS[@]}"; do
  url="http://$GITEA_USER:$GITEA_PASS@$GITEA_HOST/$OWNER/$repo.git"
  workdir=$(mktemp -d)
  if ! "$GIT" clone --mirror -q "$url" "$workdir/$repo.git" 2>>"$LOG_FILE"; then
    echo "  $repo  FAIL  clone"; FAIL=$((FAIL+1)); rm -rf "$workdir"; continue
  fi
  head_ref=$("$GIT" -C "$workdir/$repo.git" symbolic-ref HEAD 2>/dev/null || echo "-")
  head_sha=$("$GIT" -C "$workdir/$repo.git" rev-parse HEAD 2>/dev/null || echo "-")
  tarball="$DEST/$repo.tgz"
  ( cd "$workdir" && "$TAR" -czf "$tarball" "$repo.git" )
  size=$(stat -f %z "$tarball" 2>/dev/null || stat -c %s "$tarball")
  sha=$(shasum -a 256 "$tarball" | awk '{print $1}')
  printf "%s\t%s\t%s\t%s\t%s\n" "$repo" "$size" "$sha" "$head_ref" "$head_sha" \
    >> "$MANIFEST"
  TOTAL=$((TOTAL + size))
  printf "  %-40s OK   %s  %s\n" "$repo" \
    "$(printf '%d' "$size" | awk '{printf "%.1fMB", $1/1024/1024}')" \
    "${head_sha:0:8}"
  rm -rf "$workdir"
done

# --- prune old day-dirs -----------------------------------------------------
echo "--- pruning day-dirs older than $RETENTION_DAYS days ---"
now=$(date +%s)
for d in "$DEST_ROOT"/2*; do
  [[ -d "$d" ]] || continue
  mtime=$(stat -f %m "$d" 2>/dev/null || stat -c %Y "$d")
  age_days=$(( (now - mtime) / 86400 ))
  if (( age_days > RETENTION_DAYS )); then
    echo "  prune $d (age ${age_days}d)"
    rm -rf "$d"
  fi
done

# --- summary ----------------------------------------------------------------
echo "DONE $DATE  ok:$((${#REPOS[@]} - FAIL))  fail:$FAIL  size:$((TOTAL/1024/1024))MB"
(( FAIL > 0 )) && exit 4
exit 0
```

Cron entry (operator's user crontab):

```
30 2 * * * /path/to/bootstrap/scripts/backup-gitea-repos.sh
```

Around 4 seconds wall on a 20-repo set, around 26 MB total daily.

## macOS Full Disk Access — the silent failure mode

macOS Sequoia and several prior releases sandbox `cron` from `/Volumes/*` by default. The first nightly run silently fails with **no log file produced** — the `tee -a "$LOG_FILE"` itself can't write. The absence of a log file is the diagnostic.

Fix: System Settings → Privacy & Security → Full Disk Access → `+` → `/usr/sbin/cron` (use Cmd+Shift+G in the Finder picker to type the path directly). If managed-device policy blocks Full Disk Access, convert the cron entry to a `~/Library/LaunchAgents/*.plist` launchd job — permission prompts are then user-interactive instead of silent.

## Restore procedure

The inverse of the backup. Three primitives.

```bash
# 1. Extract tarball — yields a bare <repo>.git directory
cd /tmp && tar -xzf /Volumes/<your-backup-drive>/gitea-backups/2026-05-05/<repo>.tgz

# 2. Verify HEAD against MANIFEST.txt
cd /tmp/<repo>.git && git rev-parse HEAD
# Compare to column 5 of the matching MANIFEST line

# 3. Push to a fresh empty remote
git push --mirror http://<admin-user>:<admin-pass>@<host>/<owner>/<repo>.git
```

`git push --mirror` is the inverse of `git clone --mirror`. It pushes every ref. Without `--mirror`, the restored repo silently lacks tags and PR refs and the loss is only noticed weeks later.

### Integrity verification without restore

```bash
shasum -a 256 /Volumes/<your-backup-drive>/gitea-backups/<date>/<repo>.tgz
# Compare to the third tab-separated column in MANIFEST.txt
```

Run after every backup-drive change — new drive, drive moved between machines, suspicious sounds. Catches bit-rot and bad-cable corruption before the backup is needed.

### When no script existed yet — recovering from ad-hoc local clones

When the disaster predates the backup script, the recovery procedure is different:

1. Find local checkouts. A code-indexer, a librarian agent, a developer-laptop checkout — anything that happens to have full clones from before the wipe. Check external drives **first**, before any destructive recovery dance.
2. For each repo: `POST /api/v1/user/repos` to create the empty target, then `git push --all` and `git push --tags` to the new remote.
3. Re-wire CI: webhooks, branch protection, deploy keys.

This procedure is what runs once. The scheduled script is what runs forever after.

## A real incident: how 13 repos were nearly lost

The script in this article exists because of a specific event. The chronology is worth telling because it shows what actually goes wrong, in what order, and how long it takes to fix systemically.

**Day 0, three days before the wipe.** A cluster resource change required restarting Docker Desktop. The operator ran `minikube delete` to free the old VM and create a new one with adjusted memory. `minikube delete` is not a resource-change command — it is a data-destruction command that happens to free resources. Every PVC, every Secret, every event in the hub state, every Mattermost message, every backlog item: gone. The forensic trail was four lines:

```
🔥  Deleting "minikube" in docker ...
🔥  Deleting container "minikube" ...
🔥  Removing /Users/<user>/.minikube/machines/minikube ...
💀  Removed all traces of the "minikube" profile.
```

Lesson learned, no backup script written yet.

**Day 0, the wipe itself.** Docker Desktop's default memory allocation on a 64 GiB host put it around 38 GiB. Under workload, macOS jetsam killed `com.docker.backend` repeatedly. The exact signature in `~/Library/Containers/com.docker.docker/Data/log/host/com.docker.backend.log`:

```
agent-api: context cancelled
desktop state:ExitHealthyState
backend cancelled with error: <nil>
  at backend.go:560
```

After enough SIGKILLs, the `Docker.raw` file was corrupt. The Docker VM would not start. Renaming the Data folder (the support-forum advice) didn't help — the underlying cause was memory pressure, not file-layout pathology. The cluster was gone: every PVC, every container image, every cached layer, every persistent service.

**Day 0, the inventory.** Around 13 repos were declared lost. During a rebuild it's tempting to skip restoring repos that look deprecated or replaceable; the operator had assumed those repos were recoverable from a recent push and moved on. They weren't. The remote was the cluster.

**Day 0, the side-channel windfall.** An indexer agent had cloned every repo to an external drive a few days earlier for unrelated reasons. The clones were complete, recent, and on a different physical disk. Recovery took an evening: create empty repos through the Gitea API, push every ref, re-wire CI hooks. Nothing was actually lost.

**Day 9, the systemic fix.** The backup script in the previous section landed nine days after the incident. Honest reporting: a backup that depends on an unrelated agent happening to have a recent local clone is not a backup — it's a coincidence that worked once.

Lessons that survived the incident:

- **`minikube delete` is not a resource-change command — it is a data-destruction command that happens to free resources.**
- **Docker Desktop's auto-allocated memory is the disaster you're recovering from; capping it is cheaper than restoring from backup.**
- **Recovery from a side-channel (a developer's local clone) is not a backup strategy. It's how you find out you needed one.**

## What the script does NOT back up

Be explicit about it, because the framing "complete DR posture" depends on knowing what's accepted as loss.

| Class | In script? | Recovery posture |
|---|---|---|
| Gitea repos (every ref, every tag) | Yes | `git push --mirror` from tarball |
| Gitea database (users, hooks, tokens) | No | Recreate from declarative config |
| PostgreSQL data (app state) | No | Rebuild from empty; round-2 add `kubectl exec ... pg_dump` if needed |
| Mattermost messages, uploads | No | Accept as loss on a lab cluster |
| Container images | No | Rebuild from source |
| Secrets, ConfigMaps | No | Recreate from sealed-secrets manifests in the repo |
| etcd state | No | Rebuild on `kubectl apply` from manifests in the backed-up repos |

The defensible posture for a single-node lab cluster is: **back up the source of truth (repos), accept that everything derived from it rebuilds**. If application state matters, add a second cron job that does `kubectl exec <postgres-pod> -- pg_dump -U <user> <db> > $DEST/<date>/db.sql` and lives next to the repo backups. The pattern is identical: external drive, manifest, retention prune.

## Diagnostic signatures

A successful run lists each repo with `OK   <size>  <short-sha>`, prunes day-dirs older than retention, and ends with `DONE <date>  ok:N  fail:0  size:NMB`. Three common silent-failure modes:

- **Full Disk Access denied.** No log file at all at the next-night path. The absence *is* the diagnostic.
- **Cron `PATH` stripped.** Log file exists but contains `kubectl: command not found` or `git: command not found`. Fix is the absolute-path constants at the top of the script.
- **Port-forward race.** Log shows `curl: (7) Failed to connect to localhost port 3000`. Increase the `sleep 3` after the background `port-forward` to `sleep 5`, or add a retry loop polling `/api/v1/version`.

## Generalizing beyond minikube on macOS

The mechanism — mirror clone plus per-repo tarball plus sha256 manifest plus retention prune plus external disk — generalizes cleanly to any single-node Kubernetes setup hosting a Git forge. The failure modes do not.

On Linux single-node setups (k3s, kind, k0s, microk8s), the equivalent failure mode is `/var/lib/docker` filling the host disk, or the host disk itself dying. The script works the same; the macOS-specific caveats (`Docker.raw`, jetsam, Full Disk Access on `/Volumes`) drop out and are replaced by ext4/btrfs/zfs concerns and `systemd-cron` paths.

The single-node DR principle is forge-agnostic and OS-agnostic: **back up *out of the VM* — out of whatever the substrate is — or there is no backup**. Everything else is implementation detail.

