petersweb-infra/nixos/CLAUDE.md
2026-06-23 02:40:51 -07:00

222 lines
12 KiB
Markdown

# petersweb-infra/nixos — CLAUDE.md
## What this repo is
NixOS configuration for a single Hetzner server ("mainframe") running Philip Peterson's personal/Quine Foundation infrastructure. One machine, one flake configuration: `nixosConfigurations.mainframe`.
## Applying changes
```bash
./apply.sh # git pull + nixos-rebuild switch --flake .#mainframe
# or manually:
nixos-rebuild switch --flake /root/petersweb-infra/nixos#mainframe
```
## File layout
| Path | Purpose |
|---|---|
| `flake.nix` | Single flake, defines `nixosConfigurations.mainframe` |
| `hetzner.nix` | Hardware config: GRUB on `/dev/sda`, static networking, openssh |
| `linux.nix` | Main system config: services, secrets, docker containers, ACME certs |
| `nginx.nix` | Nginx virtual hosts and reverse proxies |
| `firewall.nix` | Open TCP ports |
| `disk-config.nix` | disko disk layout |
| `cloned_repos/` | `pullomatic` configs for auto-pulling git repos to `/etc/pullomatic/` |
| `arion/` | Arion (docker-compose-like) for Forgejo |
| `arion-riverside/` | Arion for the Riverside service |
| `pullomatic/` | Rust tool that watches git remotes and pulls on a schedule |
| `invoke-ddns/` | Python DDNS updater for NearlyFreeSpeech DNS |
| `secrets/` | agenix-encrypted secrets |
| `keys/` | SSH public keys used as age recipients |
| `system/` | User definitions and home-manager config |
| `pdxdestiny/` | Static site files for pdxdestiny.com |
| `vnc-desktop/` | Dockerfile + build scripts for the KDE Plasma VNC desktop container |
## Secrets (agenix)
Secrets live in `secrets/*.age`. They are encrypted with the key in `keys/mainframe.pub` (which is identical to `/root/.ssh/id_rsa_nix.pub` on the server).
**Important:** Agenix uses three identity paths for decryption (see activation script):
1. `/etc/ssh/ssh_host_rsa_key`
2. `/etc/ssh/ssh_host_ed25519_key`
3. `/root/.ssh/id_rsa_nix`**this is the actual working key**
The decrypted secrets land at `/run/agenix/<name>` at boot.
### Secret format matters
The NixOS `gitea-actions-runner` module reads the token via `EnvironmentFile=`, so the secret file must be in `KEY=VALUE` format:
- `forgejo-runner-token.age` → must contain `TOKEN=<raw_token>` (not just the raw token)
- `nearlyfreespeech.age` → contains `NEARLYFREESPEECH_API_KEY=...` and `NEARLYFREESPEECH_LOGIN=...`
- `webdav.age` → contains `WEBDAV_PASSWORD=...`
- `anthropic-api-key.age` → contains `ANTHROPIC_API_KEY=...`
- `postmark.age` → contains `POSTMARK_SERVER_TOKEN=...`
### Re-encrypting a secret
```bash
# Encrypt new content for the mainframe key
printf "TOKEN=newvalue\n" | nix run nixpkgs#age -- \
-r "$(cat /root/petersweb-infra/nixos/keys/mainframe.pub)" \
-o /root/petersweb-infra/nixos/secrets/forgejo-runner-token.age
# Verify it decrypts correctly
nix run nixpkgs#age -- -d -i /root/.ssh/id_rsa_nix \
/root/petersweb-infra/nixos/secrets/forgejo-runner-token.age
```
Note: `secrets/default.nix` is the agenix recipients file. Agenix looks for `secrets.nix` by default — to use the CLI with this repo's `default.nix`, you'd need a symlink or pass the path manually. Use `age` directly instead (as above).
## Key services
| Service | Description |
|---|---|
| `gitea-runner-ubuntu.service` | Forgejo (Gitea) Actions CI runner, uses docker images |
| `forgejo-arion.service` | Forgejo itself, run via Arion/Podman |
| `riverside-arion.service` | Riverside app, run via Arion/Docker |
| `podman-coldairnetworks-postgres.service` | PostgreSQL 16 on port 5432 (publicly exposed) |
| `podman-coldairnetworks-pgadmin.service` | pgAdmin 4 on port 5050 (localhost only) |
| `podman-navidrome.service` | Navidrome music server on port 4533 |
| `podman-nextcloud.service` | Nextcloud/SSH container on port 8087 |
| `podman-sync.io.service` | sync.io app on port 9090 |
| `podman-blog-quine.service` | Blog on port 3010 |
| `podman-coldairnetworks.service` | Cold Air Networks site on port 3012 |
| `podman-vnc-desktop.service` | KDE Plasma desktop, noVNC on port 6080 (localhost only) |
| `build-vnc-image.service` | Builds the VNC desktop image from `vnc-desktop/`; runs before `podman-vnc-desktop` |
| nginx | Reverse proxy + ACME certs for multiple domains |
## Virtualisation
- **Podman** is used for all OCI containers (`virtualisation.oci-containers.backend = "podman"`) — navidrome, nextcloud, blog, VNC desktop, etc. — and for Forgejo via Arion.
- **Docker** is still present for the Riverside Arion stack.
- `DOCKER_HOST` for the gitea-runner is set to `unix:///run/podman/podman.sock`.
- The gitea-runner runs docker images for CI jobs, so the `gitea-runner` user is in the `docker` and `podman` supplementary groups.
## PostgreSQL / pgAdmin (coldairnetworks)
Two Podman containers defined in `linux.nix` under `virtualisation.oci-containers`.
| Container | Image | Port | Role |
|---|---|---|---|
| `coldairnetworks-postgres` | `postgres:16` | 5432 (public) | PostgreSQL database |
| `coldairnetworks-pgadmin` | `dpage/pgadmin4` | 5050 (localhost) | pgAdmin 4 web UI |
### Credential files (not in git — create manually on server)
| Path | Contents |
|---|---|
| `/var/coldairnetworks-db/postgres.env` | `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB` |
| `/var/coldairnetworks-db/pgadmin.env` | `PGADMIN_DEFAULT_EMAIL`, `PGADMIN_DEFAULT_PASSWORD` |
| `/var/coldairnetworks-db/htpasswd` | nginx basic auth — generate with `htpasswd -c /var/coldairnetworks-db/htpasswd <user>` |
### Data directories
| Host path | Purpose |
|---|---|
| `/var/coldairnetworks-db/postgres` | PostgreSQL data (owned root:root) |
| `/var/coldairnetworks-db/pgadmin` | pgAdmin state (owned uid 5050 — the pgAdmin container user) |
### Access
- **Web UI**: `https://db.coldairnetworks.com` — nginx basic auth first, then pgAdmin login
- **Direct connection**: `psql -h mainframe.philippeterson.com -U admin -d coldairnetworks` (port 5432 open in firewall)
- **pgAdmin → PostgreSQL**: when adding a server in pgAdmin, use `host.containers.internal` as the hostname (Podman host gateway), port 5432
## VNC desktop
`podman-vnc-desktop.service` runs a KDE Plasma desktop inside a container, accessible via noVNC at `localhost:6080` (reverse-proxied by nginx). The image is built locally — no registry involved.
- **Image source**: `vnc-desktop/Dockerfile` (Ubuntu 24.04, TigerVNC, KDE, Firefox, patched Discover)
- **Auto-rebuild**: `build-vnc-image.service` runs on boot and on `nixos-rebuild switch` whenever `vnc-desktop/` changes. The trigger is `vncContext = builtins.path { path = ./vnc-desktop; }` — a Nix store path that invalidates when any file in the directory changes.
- **Auto-restart**: `podman-vnc-desktop.service` has `restartTriggers = [ vncContext ]`, so the container restarts automatically after a rebuild during `nixos-rebuild switch`.
- **Secrets**: `VNC_PASSWORD` and `ROOT_PASSWORD` come from `age.secrets.vnc-password`.
- **Discover logging**: `vnc-desktop/discover-logging/` contains a build-time patch (`patch.py`) that instruments `PKTransaction.cpp` with `qWarning` calls to diagnose hanging installs. Logs visible via `podman logs vnc-desktop`.
## Networking / DNS
- Dynamic DNS via `invoke-ddns` (NearlyFreeSpeech provider).
- ACME certs issued via DNS challenge for `philippeterson.com` and `webdav.philippeterson.com`.
- Forgejo accessible on ports 3000 (HTTP) and 2200 (SSH).
## OpenClaw
OpenClaw runs as two Arion/Podman containers defined in `arion-openclaw/arion-compose.nix`, both using `network_mode = "host"` so they share the host's `127.0.0.1`.
| Container | Name | Port | Role |
|---|---|---|---|
| `openclaw-gateway` | `node:22-alpine` | 18789 (WebSocket) | OpenClaw Gateway (`openclaw@latest`) |
| `openclaw` | `node:22-alpine` | 4310 (HTTP) | OpenClaw Control Center (SSR UI) |
### Volumes and paths
| Host path | Container path | Notes |
|---|---|---|
| `/var/openclaw/gateway` | `/app` (gateway), `/gateway` (app) | npm install location for `openclaw` package |
| `/var/openclaw/app` | `/app` | Control center git clone + runtime files |
| `/root/.openclaw` | `/root/.openclaw` | OpenClaw home; shared **read-write** by both containers |
`/root/.openclaw` must be **writable** in the app container (not `:ro`) — the CLI writes state files at startup and connection probes fail with EROFS otherwise.
The CLI's effective state dir is `/root/.openclaw/.openclaw/` (double-nested: the CLI treats `OPENCLAW_HOME` as HOME and appends `.openclaw/` internally).
### Auth and connectivity
- Gateway runs with `--auth none --dev`. In `--auth none` mode, clients must still present either a device identity (challenge-response) or any token via `OPENCLAW_GATEWAY_TOKEN`.
- `OPENCLAW_GATEWAY_TOKEN=openclaw-local-dev` is set in the app container — this lets the CLI probes connect immediately without waiting for device auto-approval.
- Device identity lives at `/root/.openclaw/.openclaw/identity/device.json`. In `--dev` mode the gateway auto-approves the local device after first contact.
- The control center calls `openclaw status --json` and `openclaw gateway status --json` as CLI subprocesses (not via WebSocket directly). The binary path is set via `OPENCLAW_BIN_PATH=/gateway/node_modules/.bin/openclaw`.
### nginx
`claw.quineglobal.com` is proxied to `127.0.0.1:4310`. Key settings:
- `forceSSL = false; addSSL = true` — Cloudflare Flexible SSL sends plain HTTP to origin; `forceSSL = true` would create a redirect loop.
- `basicAuthFile = "/var/openclaw/htpasswd"` — credentials: `ironmagma / Nargism333`.
- WebSocket upgrade headers are set (`Upgrade`, `Connection: upgrade`) so the control center's live-update SSE works through the proxy.
### Control center startup sequence
The app container startup script (in `arion-compose.nix`):
1. `apk add git`
2. Clones `https://github.com/TianyiDataScience/openclaw-control-center.git` to `/app/repo` (once)
3. Patches `src/ui/server.ts` and `src/runtime/ui-preferences.ts` via `sed` to default language to `"en"` instead of `"zh"`
4. `npm install && npm run build && npm run dev:ui`
### Usage connector sources
The Settings → Usage panel tracks 6 data sources. Current status:
| Source | Status | How to connect |
|---|---|---|
| Context capacity | Connected | `runtime/model-context-catalog.json` exists at `/var/openclaw/app/repo/runtime/` |
| Provider attribution | Connected | Derived from context catalog |
| Digest history | Partial (auto) | Builds up as the monitor runs over time |
| Request counts | Not connected | Needs real AI requests through the gateway |
| Budget limit | Not connected | Add cost thresholds to agent config |
| Subscription usage | Not connected | Add `runtime/subscription-snapshot.json` or provider billing snapshot |
The `model-context-catalog.json` format:
```json
{ "models": [{ "match": "gpt-5.5", "contextWindowTokens": 200000, "provider": "openai" }, ...] }
```
`match` is compared case-insensitively against the model name reported by the runtime.
### Restarting / rebuilding
After changing `arion-compose.nix`, a `nixos-rebuild switch` regenerates the compose YAML but **does not recreate running containers**. You must force recreation:
```bash
podman rm -f openclaw # or openclaw-gateway
systemctl restart arion-openclaw
```
### Cloudflare SSL gotcha
This server sits behind Cloudflare in **Flexible** mode (Cloudflare → origin over plain HTTP). Any `nginx.nix` virtualHost for a Cloudflare-proxied domain must use `forceSSL = false; addSSL = true`, not `forceSSL = true`. The latter causes an infinite redirect loop because Cloudflare sends HTTP but nginx redirects to HTTPS, which Cloudflare re-proxies as HTTP again.
## Known gotchas
- `gitea-runner` is a `DynamicUser` in the systemd service, so it has no persistent uid. Setting `age.secrets.forgejo-runner-token.owner = "gitea-runner"` causes a chown error at activation; use `owner = "root"` instead (the service reads it via `EnvironmentFile` which runs as root before privilege drop).
- `secrets/default.nix` must have the public key from `keys/mainframe.pub` as the recipient — if the host SSH keys change, you must also update `mainframe.pub` and re-key all secrets.
- `pullomatic` uses `/root/.ssh/id_rsa.pem` (a PEM-format SSH key) to pull private git repos.
- **ACME cyclic dependency list**: `linux.nix` has a `systemd.services.nginx.after = lib.mkForce [...]` list that breaks a systemd cycle between nginx and ACME services. Every new domain added with `enableACME = true` in `nginx.nix` **must** also have its `acme-selfsigned-<domain>.service` added to this list in `linux.nix`, otherwise nixos-rebuild will fail with a cyclic dependency error.