17 KiB
Platform Engineer Agent — Deployment Plan
An autonomous Hermes Agent that runs inside the k3s cluster, watches its health on a schedule, tries to fix simple problems, and notifies me (via Discord) when something needs my attention or a fix failed.
Docs: https://hermes-agent.nousresearch.com/docs/user-guide/docker
1. Goal & operating model
- One Hermes container in a new namespace
platform-engineer, scheduled on the powerful amd64 node (roger-nucbox-evo-x2, 24 GiB RAM). - Hermes runs in gateway mode under s6 supervision (
command: gateway run), so the built-in cron scheduler is active and survives restarts. - The agent talks to the cluster with
kubectlfrom inside the container (terminal backend =local). We give the pod a ServiceAccount + ClusterRole scoped to read-mostly + restart/scale/delete-pod permissions. - LLM calls are routed through the in-cluster LiteLLM proxy
(
litellm.rogi.casa) — no external API keys needed in the cluster. - Notifications go to Discord (reuse the pattern from
myorg-assistant). - A set of cron jobs (Hermes-native, not Kubernetes CronJobs) make the agent
run periodic checks. Watchdog checks use
[SILENT]so it only pings me when something is wrong.
Why Hermes-native cron (not k8s CronJobs):
- Hermes cron ticks inside the gateway, runs in an isolated agent session,
supports
[SILENT]suppression,deliver="discord",workdir, andcontext_fromchaining — far less plumbing than spawning a fresh pod per run. - Cron jobs live in
~/.hermes/cron/jobs.jsonon the PVC, so they survive pod restarts and can be edited live viahermes cron editwithout redeploying.
2. Files to create (this directory)
platform-engineer/
├── namespace.yaml # namespace platform-engineer
├── rbac.yaml # ServiceAccount + ClusterRole (+binding)
├── configmap.yaml # hermes config.yaml + SOUL.md + cron seed script
├── secret.yaml # DISCORD bot token, LITELLM_API_KEY, kubeconfig-less SA token
├── pvc.yaml # persistent /opt/data (HERMES_HOME)
├── dockerfile # derived image: hermes-agent + kubectl + helm
├── deployment.yaml # Deployment, schedules on amd64, mounts kube SA token
├── ingress.yaml # hermes.rogi.casa → dashboard (optional)
└── README.md # this file
Then add a line to argocd/gen-apps.sh APPS=(...):
"platform-engineer|platform-engineer|platform-engineer|true|true"
and re-run ./argocd/gen-apps.sh to generate argocd/apps/platform-engineer.yaml
so ArgoCD reconciles it like every other app in the repo.
3. RBAC — least privilege
ServiceAccount platform-engineer in ns platform-engineer, bound to a
ClusterRole scoped to platform engineer actions:
Read (get/list/watch): nodes, pods, services, deployments, statefulsets, daemonsets, replicasets, jobs, cronjobs, events, configmaps, secrets, PVCs, ingresses, namespaces.
Act (patch/update on a allowlist):
pods→delete(force-restart a stuck pod),patch(/evict, annotations)deployments,statefulsets,daemonsets,replicasets→patch(restart viakubectl rollout restart/ scale),updatejobs,cronjobs→delete,patchpods/exec(subresource) →create(only if we want the agent tokubectl execinto pods for log-style debugging — optional; keep off initially)events→get/list/watchonly
No cluster-scoped writes (no creating namespaces, no node taints, no RBAC
edits, no CRDs). The agent can propose those and tell me; it cannot do them
itself. All mutating calls are auditable via Kubernetes audit logs and
kubectl auth can-i --as=system:serviceaccount:platform-engineer:platform-engineer.
The pod uses the k3s in-cluster ServiceAccount token (/var/run/secrets/... /serviceaccount/token) + the KUBERNETES_SERVICE_HOST/PORT env vars k3s already
injects — no kubeconfig file, no long-lived token on disk.
4. Image — thin derived Dockerfile
FROM nousresearch/hermes-agent:latest
USER root
RUN apt-get update \
&& apt-get install -y --no-install-recommends curl gnupg \
&& curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key \
| gpg --dearmor -o /usr/share/keyrings/kubernetes-apt-keyring.gpg \
&& echo 'deb [signed-by=/usr/share/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /' \
> /etc/apt/sources.list.d/kubernetes.list \
&& apt-get update \
&& apt-get install -y --no-install-recommends kubectl \
&& curl -fsSL https://get.helm.sh/helm-v3.16.0-linux-amd64.tar.gz \
| tar -xz -C /usr/local/bin --strip-components=1 linux-amd64/helm \
&& rm -rf /var/lib/apt/lists/*
USER hermes
Note: the cluster is mixed arch (arm64/amd64/arm). The agent pod is pinned to the amd64 node, so
linux-amd64helm +kubectlpackages are fine. If you later want it portable, switch to a multi-arch build withTARGETARCHand install matching helm arch.
Build & push to your Gitea registry (git.rogi.casa/roger/...) — same
imagePullSecrets: gitea-registry pattern as gym-tracker. Tag with the
hermes version + a short git sha.
5. Hermes configuration (mounted via ConfigMap → /opt/data/config.yaml)
# config.yaml (seeded into the PVC on first boot)
model:
provider: openai-api
default: claude-4.5-haiku
base_url: "https://litellm.rogi.casa/v1"
api_mode: chat_completions
# Use a cheap, fast model for auxiliary tasks (titling, compression)
auxiliary:
compression:
provider: openai-api
model: gemini-3-flash
title_generation:
provider: openai-api
model: gemini-3-flash
terminal:
backend: local
cwd: /workspace # a working dir for any kubectl output / scratch
timeout: 180
home_mode: profile # isolate tool credentials under HERMES_HOME/home
# Unattended gateway → circuit-breaker on tool-call loops
tool_loop_guardrails:
hard_stop_enabled: true
hard_stop_after:
exact_failure: 5
idempotent_no_progress: 5
sessions:
auto_prune: true
retention_days: 90
cron:
wrap_response: false # cleaner Discord messages
memory:
memory_enabled: true
user_profile_enabled: true
.env (from Secret, mounted to /opt/data/.env):
OPENAI_API_KEY=<LITELLM_API_KEY value, i.e. sk-...>
OPENAI_BASE_URL=https://litellm.rogi.casa/v1
DISCORD_BOT_TOKEN=<new dedicated bot token>
DISCORD_HOME_CHANNEL=<your user/channel id for alerts>
# Dashboard auth (homelab, trusted LAN behind ingress)
HERMES_DASHBOARD_BASIC_AUTH_USERNAME=roger
HERMES_DASHBOARD_BASIC_AUTH_PASSWORD=<strong password>
Why
OPENAI_API_KEY+OPENAI_BASE_URL: theopenai-apiprovider honoursOPENAI_BASE_URL, so this is the simplest way to point Hermes at the in-cluster LiteLLM.claude-4.5-haiku/gemini-3-flashare the model names already exposed by yourlitellm/litellm.yamlConfigMap.
SOUL.md (personality + guardrails) — see configmap.yaml. Key points:
- Identity: "Platform Engineer for the rogi.casa k3s cluster."
- Knows the cluster layout (3 nodes, ArgoCD GitOps, Traefik+cert-manager, LiteLLM, services list).
- Operating rules: read-first; only act on the allowlisted verbs; never edit RBAC / taints / namespaces / CRDs; when in doubt, notify instead of acting; always cite the resource and the command used.
- How to reach me:
deliver="discord".
6. Deployment
replicas: 1(Hermes data dir is single-writer — never scale >1).nodeSelector: kubernetes.io/arch: amd64+ preferredhardware: high-memoryaffinity → lands on the NUC.resources: requests 512Mi/250m, limits 2Gi/1 core (Hermes recommends 2–4 GiB; 1 GiB is fine without browser tools, which we keep off).- Volume: PVC mounted at
/opt/data(HERMES_HOME), RWX not needed (single pod). - Ports: 8642 (gateway API, internal only) and 9119 (dashboard) → exposed via
Ingress
hermes.rogi.casawith TLS + basic-auth (already enforced by theHERMES_DASHBOARD_BASIC_AUTH_*env vars). imagePullSecrets: gitea-registry.- env from Secret;
HERMES_DASHBOARD=1. - Init: on first boot the s6
01-hermes-setuphook seeds config/SOUL/.env from the ConfigMap if the volume is empty. We mount the ConfigMap as a readonly projection at/opt/seed/and run a tiny initContainer to copy it into/opt/dataonly when/opt/data/config.yamldoesn't exist (so ArgoCD self-heal never fights the agent's live-edited config).
7. Cron jobs to seed (Hermes-native)
These are written by an init script (one-shot Job hermes-cron-seed) that runs
hermes cron create ... against the gateway on first install, and is idempotent
(it checks existing job names). All deliver to Discord. Examples:
| Name | Schedule | Prompt (abbreviated) |
|---|---|---|
cluster-health-check |
every 15m |
Run kubectl get nodes,pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded and kubectl get events -A --field-selector type=Warning --since=20m. If everything healthy, reply with only [SILENT]. Otherwise summarize failures and root-cause briefly. |
pod-restart-loop |
every 10m |
Find pods in CrashLoopBackOff/ImagePullBackOff across all namespaces. For CrashLoopBackOff, fetch logs and if a clear transient cause (OOM, config parse, missing secret) is visible, attempt kubectl rollout restart <deploy>; otherwise notify me with the log excerpt. Reply [SILENT] if none found. |
pvc-pressure |
every 30m |
kubectl get pv + node disk via kubectl top nodes. Alert if any PVC Bound to a near-full volume or node disk >85%. [SILENT] otherwise. |
argocd-sync-health |
every 1h |
kubectl get applications -n argocd -o wide (or argocd app sync --dry-run if CLI present). Report any OutOfSync/Degraded app. [SILENT] if all Synced+Healthy. |
cert-expiry |
every 1d at 09:00 |
List cert-manager Certificate resources with expiry < 21 days. Notify only if any. [SILENT] otherwise. |
node-resource-drift |
every 30m |
kubectl top nodes. Alert if any node CPU>90% or mem>90% sustained, or any node NotReady. [SILENT] otherwise. |
daily-cluster-report |
0 8 * * * |
Summarize: node count/status, top 5 pods by CPU/mem, # pods not Running, # ArgoCD apps OutOfSync, cert warnings. Always deliver (no [SILENT]). |
Design rules baked into SOUL.md:
- Read-only checks run frequently (10–30m) and stay silent unless wrong.
- Mutating actions are restricted to safe idempotent ones (rollout restart, delete stuck pod so controller recreates). Anything riskier → notify me with a proposed command and wait for me to run it (I can reply in Discord to the continuable thread).
- Cron sessions are isolated and cannot create new cron jobs (Hermes disables that inside cron runs) → no runaway loops.
8. Safety & guardrails
- RBAC is the real boundary. Even if the agent goes rogue, the SA can't touch other namespaces' secrets beyond read, can't change RBAC, can't taint nodes, can't create namespaces.
tool_loop_guardrails.hard_stop_enabled: true— circuit-breaks a stuck gateway (recommended in the Docker doc for unattended deployments).skills.write_approval: falsebutmemory.write_approval: true(so the agent can build skills/memories but I review memory writes lazily — flip this if it gets noisy).- No
pods/execsubresource initially (keep the agent from shelling into workloads). Enable later only if you want log-grep-style debugging. - Dashboard behind ingress TLS + basic auth (the June-2026 hardening makes auth mandatory on non-loopback binds; we satisfy it with the bundled basic-auth provider).
- Single replica / single-writer PVC — the Docker doc is explicit that two
gateways on the same
/opt/datacorrupt session/memory stores. Use apodAntiAffinityso an accidental scale-up doesn't co-run. - ArgoCD interaction: keep
syncPolicy.automated.prune+selfHealbut exclude the live-edited hermes state. Practically: Argo owns the manifests (deployment, configmap, secret, pvc), while/opt/data(config.yaml, cron/jobs.json, SOUL.md edits made via the dashboard) is runtime state on the PVC and is not reconciled by Argo. The ConfigMap only seeds it on first boot. Document this clearly in the README so future-you doesn't expect Argo to reset the agent's personality.
9. Rollout plan
- Build & push the derived image to
git.rogi.casa/roger/hermes-agent(tagv1.35-<sha>). - Create the namespace + RBAC + Secret + ConfigMap + PVC:
kubectl apply -f platform-engineer/. - Create the
platform-engineerDiscord bot, invite it, put its token + your channel id insecret.yaml(base64). - Apply the Deployment; wait for the pod to go Running.
kubectl execin and run the one-shot cron seed:hermes cron create ...(or apply thecron-seedJob).- Trigger the first
cluster-health-checkmanually:hermes cron run cluster-health-check. - Add the app to
argocd/gen-apps.sh, regenerate, commit, push.
10. Decisions (locked in)
- Notifications: dedicated
platform-engineerDiscord bot → its own token insecret.yaml(DISCORD_BOT_TOKEN,DISCORD_HOME_CHANNEL). - Dashboard: public at
hermes.rogi.casa(Traefik TLS + cert-manager + the bundled Hermes basic-auth provider). Reach the dashboard on port 9119; the gateway API on 8642 is ClusterIP-only. - Image: derived image pushed to
git.rogi.casa/roger/hermes-agent, pulled via the existinggitea-registryimagePullSecret (must also exist in theplatform-engineerns — see deploy steps). - Model:
qwen-3.6:27bvia the in-cluster Ollama box (10.88.20.12:11434), exposed through LiteLLM asqwen-3.6:27b. Added tolitellm/litellm.yaml. Hermes reaches LiteLLM athttps://litellm.rogi.casa/v1(never Ollama directly). - pods/exec: granted (
pods/exec→createin the ClusterRole) so the agent cankubectl exec/kubectl logsfor debugging.
11. Deployment checklist (do in this order)
- Add the Ollama model to LiteLLM (already done in
litellm/litellm.yaml): theqwen-3.6:27bentry points athttp://10.88.20.12:11434. Make sureqwen3.6:27bis actually pulled on that Ollama host (ollama pull qwen3.6:27b). Apply:kubectl apply -f litellm/and restart the LiteLLM pod so the new config takes effect. - Create the
gitea-registrysecret in the new namespace (ArgoCD won't create it — it's not in the repo):kubectl create namespace platform-engineer kubectl create secret docker-registry gitea-registry \ --docker-server=git.rogi.casa \ --docker-username=<your-gitea-user> \ --docker-password=<gitea-access-token> \ --docker-email=<your-email> \ -n platform-engineer - Build & push the image:
./platform-engineer/build-and-push.sh(afterdocker login git.rogi.casa). - Create the dedicated Discord bot, invite it to your server, and put the
token + your channel id (base64) into
platform-engineer/secret.yaml. Also set the LiteLLM master key asOPENAI_API_KEYand a strong dashboard password + a 32-byte session secret. - Commit & push the whole change. ArgoCD will create the namespace
resources, deploy the pod, and bring up the ingress at
hermes.rogi.casa. - Seed the cron jobs:
kubectl apply -f platform-engineer/cron-seed.yaml(one-shot Job) — it waits for the hermes pod, then runshermes cron create ...for each watchdog. Re-run it any time you want to re-seed after a wipe. - Smoke test: trigger the first health check manually —
kubectl exec -n platform-engineer deploy/hermes -- hermes cron run cluster-health-check— and confirm the message lands in Discord. - ArgoCD: the
Application(argocd/apps/platform-engineer.yaml) is already generated. After commit, Argo will reconcile it like every other app.
12. What ArgoCD owns vs. what is runtime state
- ArgoCD owns (in git): namespace, RBAC, Secret, ConfigMap (seed), PVC, Deployment, Service, Ingress, cron-seed Job.
- Runtime state (on the PVC, NOT reconciled):
config.yaml,SOUL.md,.env,cron/jobs.json,sessions/,memories/,skills/. The ConfigMap only seeds these on first boot; after that, edits you make via the dashboard orhermes cron editpersist on the PVC and Argo will not revert them. If you ever want a hard reset, delete the PVC and re-apply.
Files in this directory
| File | Purpose |
|---|---|
namespace.yaml |
namespace platform-engineer |
rbac.yaml |
ServiceAccount + ClusterRole (+binding), least-privilege |
configmap.yaml |
seed config.yaml + SOUL.md |
secret.yaml |
Discord token, LiteLLM key, dashboard auth (PLACEHOLDERS — fill in) |
pvc.yaml |
5 Gi PVC for /opt/data |
dockerfile |
derived image: hermes-agent + kubectl + helm (linux/amd64) |
build-and-push.sh |
builds & pushes the image to the Gitea registry |
deployment.yaml |
Deployment (1 replica, Recreate, pinned to amd64 NUC) + Service |
ingress.yaml |
hermes.rogi.casa → dashboard (TLS + basic auth) |
cron-seed.yaml |
one-shot Job that creates the Hermes cron schedule |
Also changed outside this directory:
litellm/litellm.yaml— addedqwen-3.6:27bmodel entry.argocd/gen-apps.sh+argocd/apps/platform-engineer.yaml— ArgoCD Application for this folder.