new platform engineer agent

monitoring: add dashboard ideas doc
Survey of dashboards that could be built from existing and not-yet-enabled metrics across the cluster's services (traefik, coredns, metallb, cert-manager, phoenix, litellm, gitea, postgres, etc.), with per-service enable steps and a recommended priority order.
2026-06-27 00:09:39 +02:00 · 2026-06-26 20:22:54 +02:00
14 changed files with 1280 additions and 1 deletions
--- a/argocd/apps/platform-engineer.yaml
+++ b/argocd/apps/platform-engineer.yaml
@@ -0,0 +1,24 @@
 apiVersion: argoproj.io/v1alpha1
 kind: Application
 metadata:
  name: platform-engineer
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "0"
 spec:
  project: k3s-cluster
  source:
    repoURL: https://git.rogi.casa/roger/k3s-cluster.git
    targetRevision: main
    path: platform-engineer
    directory:
      recurse: true
  destination:
    server: https://kubernetes.default.svc
    namespace: platform-engineer
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=false
--- a/argocd/gen-apps.sh
+++ b/argocd/gen-apps.sh
@@ -38,6 +38,7 @@ APPS=(
  "openwebui|openwebui|openwebui|true|true"
  "phoenix|phoenix|phoenix|true|false"
  "pihole|pihole|pihole|true|true"
  "platform-engineer|platform-engineer|platform-engineer|true|true"
  "qbittorrent|qbittorrent|qbittorrent|true|true"
  "vaultwarden|vaultwarden|vaultwarden|true|true"
 )
--- a/litellm/litellm.yaml
+++ b/litellm/litellm.yaml
@@ -26,7 +26,12 @@ data:
      - model_name: glm-4.7-flash
        litellm_params:
          model: ollama/glm-4.7-flash
-          api_base: http://10.88.88.235:11434
+          api_base: http://10.88.20.12:11434
      # Used by the platform-engineer Hermes agent (deployed in ns platform-engineer).
      - model_name: qwen-3.6:27b
        litellm_params:
          model: ollama/qwen3.6:27b
          api_base: http://10.88.20.12:11434
    litellm_settings:
      #set_verbose: True  # Uncomment this if you want to see verbose logs; not recommended in production
      callbacks: ["arize_phoenix"]
--- a/monitoring/dashboard-ideas.md
+++ b/monitoring/dashboard-ideas.md
@@ -0,0 +1,354 @@
 # Dashboard Ideas
 This file collects ideas for additional Grafana dashboards to build for the
 `rogi.casa` k3s cluster. Each idea notes the **data source** (metrics already
 available vs. metrics that need to be enabled) and a rough panel layout.
 To actually add a dashboard, create a `grafana-dashboard-<name>.yaml` ConfigMap
 in this folder, mount it in `grafana-deployment.yaml` (add a volume +
 volumeMount under `/var/lib/grafana/dashboards/<name>`), commit and push.
 ---
 ## Already-scraped services (ready to dashboard now)
 These exporters/services are **already being scraped by Prometheus** — dashboards
 can be built immediately with no infra changes.
 ### 1. Traefik (Ingress) — `traefik_*`
 Traefik is scraped via the `kubernetes-pods` job (pod annotation on
 `traefik-9bcdbbd9-x8zq4` in `kube-system`). It exposes request counters, entry
 point latency, TLS handshakes, config reloads.
 **Panels:**
 - Requests/sec by entrypoint (web / websecure / traefik) — `rate(traefik_entrypoint_requests_total[5m])`
 - Request latency p50/p95/p99 — `histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le, entrypoint))`
 - HTTP status code distribution (2xx/3xx/4xx/5xx) — `rate(traefik_entrypoint_requests_total{code=~"2xx|3xx|4xx|5xx"}[5m])`
 - TLS handshakes/sec — `rate(traefik_entrypoint_requests_tls_total[5m])`
 - Config reloads + last reload success — `traefik_config_reloads_total`, `traefik_config_last_reload_success`
 - Top routes/services by request volume — `topk(10, sum by (service) (rate(traefik_service_requests_total[5m])))`
 - Bytes transferred in/out — `rate(traefik_entrypoint_requests_bytes_total[5m])`
 **Why useful:** This is your front door. Knowing which routes get hit most,
 latency per ingress, and 5xx spikes is the single most valuable app-level
 dashboard in the cluster.
 ---
 ### 2. CoreDNS (cluster DNS) — `coredns_*`
 Scraped via `kube-dns` Service annotation. Exposes query rate, cache hits,
 error types, response duration.
 **Panels:**
 - DNS queries/sec by zone / type — `rate(coredns_dns_requests_total[5m])`
 - Cache hit ratio — `rate(coredns_cache_hits_total[5m]) / rate(coredns_cache_requests_total[5m])`
 - DNS query latency p95 — `histogram_quantile(0.95, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))`
 - Queries by response code (NOERROR / NXDOMAIN / SERVFAIL) — `rate(coredns_dns_responses_total[5m])`
 - Cache size — `coredns_cache_entries`
 - Forward requests/sec (upstream DNS) — `rate(coredns_forward_requests_total[5m])`
 **Why useful:** DNS issues cause cascading failures (ImagePullBackOff, cert
 challenges, etc.). A spike in NXDOMAIN/SERVFAIL is an early warning.
 ---
 ### 3. MetalLB (LoadBalancer) — `metallb_*`
 Scraped via pod annotation on `speaker-*` and `controller` in `metallb-system`.
 Exposes IP allocation usage, BGP/session state.
 **Panels:**
 - IP addresses in use vs. total — `metallb_allocator_addresses_in_use_total` / `metallb_allocator_addresses_total`
 - IP pool utilization % (gauge) — `metallb_allocator_addresses_in_use_total / metallb_allocator_addresses_total * 100`
 - BGP session up per speaker — `metallb_bgp_session_up`
 - Config loaded / stale status — `metallb_k8s_client_config_loaded_bool`, `metallb_k8s_client_config_stale_bool`
 - Announcements per speaker — `rate(metallb_bgp_announcements_total[5m])`
 **Why useful:** If MetalLB runs out of IPs, new LoadBalancer services will
 hang in `<pending>`. Knowing pool utilization lets you act before that happens.
 ---
 ### 4. cert-manager (TLS certificates) — `certmanager_*`
 Scraped via pod annotations on cert-manager pods. Exposes certificate
 expiration, renewal, ready status, ACME challenges.
 **Panels:**
 - Certificate expiration (days remaining, sorted) — table of `(certmanager_certificate_not_after_timestamp_seconds - time()) / 86400`
 - Certificates not Ready — `certmanager_certificate_ready_status{condition="Ready",status!="True"}`
 - Upcoming renewals (next 14 days) — `certmanager_certificate_renewal_timestamp_seconds`
 - ACME challenge status — `certmanager_certificate_challenge_status`
 - Failed renewals counter — `rate(certmanager_certificate_renewal_total{condition="Failed"}[1h])`
 **Why useful:** A cert about to expire (or silently failing to renew) is the
 kind of thing that takes down `*.rogi.casa` HTTPS with no warning. This is a
 must-have alert/dashboard.
 ---
 ### 5. Phoenix (trace store) — `phoenix_*`
 Already scraped via the `phoenix` Service annotation. Exposes bulk loader
 ingestion rates, span insertion times, retention sweeper, exceptions.
 **Panels:**
 - Span ingestion rate — `rate(phoenix_bulk_loader_span_insertion_time_seconds_count[5m])`
 - Span insertion latency p95 — `histogram_quantile(0.95, sum(rate(phoenix_bulk_loader_span_insertion_time_seconds_bucket[5m])) by (le))`
 - Span exceptions/sec — `rate(phoenix_bulk_loader_span_exceptions_total[5m])`
 - Retention sweeper last run — `phoenix_retention_sweeper_last_run_seconds`
 - Last activity timestamp — `phoenix_bulk_loader_last_activity_timestamp_seconds`
 **Why useful:** Phoenix is your observability backend's own backend. Tracking
 ingestion health tells you whether traces are landing.
 ---
 ## Infrastructure dashboards (compose from existing metrics)
 ### 6. Storage & PVC Health (KSM + kubelet + node-exporter)
 Cross-source dashboard combining `kube_persistentvolumeclaim_*` (KSM),
 `kubelet_volume_stats_*` (kubelet), and `node_filesystem_*` (node-exporter).
 **Panels:**
 - PVC usage % per claim — `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100`
 - PVC requested vs. capacity — `kube_persistentvolumeclaim_resource_requests_storage_bytes` vs actual
 - Node disk usage % (all mounts) — `(1 - node_filesystem_avail / node_filesystem_size) * 100`
 - Inode usage % per mount — `(1 - node_filesystem_files_free / node_filesystem_files) * 100`
 - Volume binding status (Bound/Pending) — `kube_persistentvolumeclaim_status_phase`
 - Top 10 PVCs by usage (table)
 **Why useful:** The `local-path` provisioner fills up node disks. Catching a
 PVC at 95% before it errors is a lifesaver.
 ---
 ### 7. Workload Health (KSM)
 Uses kube-state-metrics to show deployment/StatefulSet/CronJob health cluster-wide.
 **Panels:**
 - Deployments with unavailable replicas — `kube_deployment_status_replicas_available < kube_deployment_status_replicas`
 - Pods not in Running phase by namespace — `kube_pod_status_phase{phase!="Running"}`
 - Container restarts (last 1h) — `increase(kube_pod_container_status_restarts_total[1h])`
 - Pods stuck in CrashLoopBackOff / ImagePullBackOff — `kube_pod_container_status_waiting_reason{reason=~"CrashLoopBackOff|ImagePullBackOff"}`
 - Job failures — `kube_job_failed`
 - CronJob schedule heatmap — `kube_cronjob_status_active`
 - HPA status (if any autoscaled) — `kube_horizontalpodautoscaler_status_current_replicas` vs desired
 **Why useful:** This is the "is anything broken" board. Notice you already have
 some pods in `ImagePullBackOff` (myorg-assistant) — this dashboard surfaces that.
 ---
 ### 8. etcd / Control Plane Health (if exposed)
 k3s embeds etcd (or sqlite on single-node). etcd metrics require exposing
 the etcd `/metrics` endpoint (typically `--listen-metrics-urls` on the control
 plane node). **Requires config change to enable.**
 **Panels:**
 - Leader changes — `etcd_server_leader_changes_seen_total`
 - Proposal commits/sec — `rate(etcd_server_proposals_committed_total[5m])`
 - Proposal failures/sec — `rate(etcd_server_proposals_failed_total[5m])`
 - DB size — `etcd_mvcc_db_total_size_in_bytes`
 - RPC latency p99 — `histogram_quantile(0.99, sum(rate(etcd_grpc Unary grpc latency bucket[5m])) by (le))`
 - Active watchers — `etcd_debugging_mvcc_watcher_total`
 **Why useful:** etcd is the brain of the cluster. Slow commits or a flipping
 leader indicates control-plane trouble.
 ---
 ## App-service dashboards (require enabling metrics first)
 Most of your apps don't expose `/metrics` yet. Below is the per-service setup
 plus the dashboard idea once metrics are on. To enable scraping for any of
 these, annotate the Service with:
 ```yaml
 metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "<port>"
 ```
 The existing `kubernetes-service-endpoints` scrape job will pick them up
 automatically — **no Prometheus config edit needed**.
 ### 9. LiteLLM (LLM gateway) — needs enabling
 LiteLLM exposes Prometheus metrics on its API port (`/metrics`). Annotate the
 `litellm` Service.
 **Panels:**
 - Requests/sec by model — `rate(litellm_requests_total[5m])` by `model`
 - Token usage (prompt/completion/total) — `rate(litellm_total_tokens_total[5m])`
 - Spend by model — `litellm_spend_total` (if cost tracking enabled)
 - Latency p95 per model — `histogram_quantile(0.95, ...)`
 - Error rate by model — `rate(litellm_requests_total{status=~"5.."}[5m])`
 - Rate-limit / quota hits
 **Why useful:** LiteLLM is the gateway for all your AI apps (open-webui,
 myorg-assistant, etc.). Token spend + per-model latency is the single best
 cost/quality lever in the cluster.
 ---
 ### 10. Gitea (git + CI) — needs enabling
 Gitea exposes metrics at `/metrics` when `ENABLE_METRICS=true` in `app.ini`.
 Annotate `gitea-http` Service (port 3000 inside, 80 via svc).
 **Panels:**
 - Git push/clone/fetch rate — `gitea_actions_total` by `action`
 - Active users / repos / orgs — `gitea_users_total`, `gitea_repos_total`
 - Issues / PRs open — `gitea_issues_total`, `gitea_pulls_total`
 - HTTP request rate + latency
 - Gitea Actions runner job duration — if runner metrics exposed
 **Why useful:** Gitea hosts the cluster's own GitOps repo + CI. Tracking push
 rate and runner throughput catches CI storms.
 ---
 ### 11. Home Assistant — needs enabling
 HA exposes Prometheus metrics via the `prometheus` integration (add to
 `configuration.yaml`). Then annotate the Service.
 **Panels:**
 - Active entities / sensors by domain
 - State change events/sec — `homeassistant_entity_states_total`
 - Automation triggers/sec — `homeassistant_automation_triggered_total`
 - Integrations loaded + errors
 - Database size / recorder queue depth
 - Zigbee/Z-Wave mesh health (if exposed)
 **Why useful:** HA is a home-critical service. Event/sec spikes often indicate
 sensor flapping or runaway automations.
 ---
 ### 12. Jellyfin — limited
 Jellyfin doesn't ship first-class Prometheus metrics, but you can scrape it
 via a sidecar (`jellyfin-prometheus-exporter`) or build a blackbox-style
 dashboard on the `/health` endpoint.
 **Panels:**
 - Active streams — from exporter
 - Transcode sessions + hw accel usage
 - Library size by media type
 - Playback errors
 ---
 ### 13. Pi-hole — needs enabling
 Pi-hole exposes metrics on its FTL web API; the `pihole-exporter` sidecar
 converts them to Prometheus format. Add as a sidecar container + annotate.
 **Panels:**
 - DNS queries/sec (total, blocked, cached, forwarded)
 - Block list size
 - Top blocked domains
 - Top permitted domains
 - Clients by query volume
 - Cache hit ratio
 **Why useful:** Pi-hole is your network-wide adblock. Block rate + cache ratio
 are the headline metrics, and query spikes reveal misbehaving clients.
 ---
 ### 14. PostgreSQL (litellm + phoenix + n8n) — needs enabling
 You have two Postgres instances (`postgres` in `litellm` and `phoenix`).
 Add `prometheus-postgres-exporter` as a sidecar or Deployment per DB.
 **Panels (per DB):**
 - Connections (active / idle / max) — `pg_stat_activity_count`
 - Transactions/sec — `rate(pg_stat_database_xact_commit[5m])`
 - Cache hit ratio — `pg_stat_database_blks_hit / (blks_hit + blks_read)`
 - Table + index bloat
 - Replication lag (if replicas)
 - Slow queries (if `pg_stat_statements` enabled)
 - DB size growth — `pg_database_size_bytes`
 **Why useful:** DB connection exhaustion and cache ratio collapse are the two
 most common causes of slow app performance.
 ---
 ### 15. Minecraft — limited
 The Minecraft server exposes metrics via RCON + an exporter
 (`minecraft-exporter`). Add as sidecar using the existing `RCON_PASSWORD`.
 **Panels:**
 - Players online — `minecraft_players_online`
 - TPS (ticks per second) — `minecraft_tps` (server health)
 - Entities loaded — `minecraft_entities_total`
 - Chunk count — `minecraft_chunks_loaded`
 - Memory used by JVM
 **Why useful:** TPS < 20 means lag. Player count vs. server load is the only
 real signal a Minecraft server needs.
 ---
 ### 16. qBittorrent — limited
 No native metrics. Options: a `qbittorrent-exporter` sidecar (uses the WebUI
 API), or a blackbox probe on the WebUI.
 **Panels:**
 - Download/upload speed
 - Active torrents
 - Torrent count by state (downloading/seeding/paused)
 - Disk usage in download dir
 ---
 ## Cluster meta dashboards
 ### 17. Network Topology / Service Map
 Composite view: for each namespace, list services, their pods, scrape status,
 and request volume (from Traefik logs + cAdvisor network). A "what talks to
 what" overview.
 **Panels:**
 - Service → pod → container resource table
 - Cross-namespace network flows (if network policy logging enabled)
 - Scrape health matrix (every target up/down)
 - Ingress route → backend service map
 ---
 ### 18. Backup / Snapshot Status
 If you take Velero snapshots or local-path snapshots, build a dashboard on
 `velero_*` or CRD status. **Requires Velero.**
 **Panels:**
 - Last successful backup per namespace
 - Failed backups
 - Backup size growth
 - Restore test status
 ---
 ### 19. Cost / Capacity Planning
 Composite: per-namespace CPU/memory requests vs. actual usage, projected
 growth, node saturation forecast.
 **Panels:**
 - Requests vs. limits vs. actual (per namespace) — KSM + cAdvisor
 - Node capacity vs. allocatable
 - PVC growth trend + 30-day forecast
 - "What if I removed node X" simulation (capacity headroom)
 **Why useful:** Tells you when you'll need another node before you hit the wall.
 ---
 ## Recommended priority order
 If you only build a few, do them in this order (highest value-to-effort first):
 1. **Traefik Ingress** (#1) — already scraped, your front door
 2. **Storage & PVC Health** (#6) — local-path fills disks; high blast radius
 3. **Workload Health** (#7) — surfaces CrashLoopBackOff / ImagePullBackOff
 4. **cert-manager** (#4) — prevents silent cert expiry outages
 5. **CoreDNS** (#2) — early warning for DNS cascades
 6. **LiteLLM** (#9) — needs `prometheus.io/scrape` annotation only; big insights
 7. **MetalLB** (#3) — small but catches LoadBalancer IP exhaustion
 Items 8–19 are nice-to-have or require additional exporters/config.
--- a/platform-engineer/README.md
+++ b/platform-engineer/README.md
@@ -0,0 +1,367 @@
 # Platform Engineer Agent — Deployment Plan
 An autonomous **Hermes Agent** that runs inside the k3s cluster, watches its
 health on a schedule, tries to fix simple problems, and notifies me (via
 Discord) when something needs my attention or a fix failed.
 Docs: https://hermes-agent.nousresearch.com/docs/user-guide/docker
 ---
 ## 1. Goal & operating model
 - **One Hermes container** in a new namespace `platform-engineer`, scheduled on
  the powerful amd64 node (`roger-nucbox-evo-x2`, 24 GiB RAM).
 - Hermes runs in **gateway mode** under s6 supervision (`command: gateway run`),
  so the built-in **cron scheduler** is active and survives restarts.
 - The agent talks to the cluster with `kubectl` from *inside* the container
  (terminal backend = `local`). We give the pod a **ServiceAccount + ClusterRole**
  scoped to read-mostly + restart/scale/delete-pod permissions.
 - LLM calls are routed through the in-cluster **LiteLLM** proxy
  (`litellm.rogi.casa`) — no external API keys needed in the cluster.
 - Notifications go to **Discord** (reuse the pattern from `myorg-assistant`).
 - A set of **cron jobs** (Hermes-native, not Kubernetes CronJobs) make the agent
  run periodic checks. Watchdog checks use `[SILENT]` so it only pings me when
  something is wrong.
 Why Hermes-native cron (not k8s CronJobs):
 - Hermes cron ticks inside the gateway, runs in an isolated agent session,
  supports `[SILENT]` suppression, `deliver="discord"`, `workdir`, and
  `context_from` chaining — far less plumbing than spawning a fresh pod per run.
 - Cron jobs live in `~/.hermes/cron/jobs.json` on the PVC, so they survive pod
  restarts and can be edited live via `hermes cron edit` without redeploying.
 ---
 ## 2. Files to create (this directory)
 ```
 platform-engineer/
 ├── namespace.yaml              # namespace platform-engineer
 ├── rbac.yaml                    # ServiceAccount + ClusterRole (+binding)
 ├── configmap.yaml               # hermes config.yaml + SOUL.md + cron seed script
 ├── secret.yaml                  # DISCORD bot token, LITELLM_API_KEY, kubeconfig-less SA token
 ├── pvc.yaml                     # persistent /opt/data (HERMES_HOME)
 ├── dockerfile                   # derived image: hermes-agent + kubectl + helm
 ├── deployment.yaml              # Deployment, schedules on amd64, mounts kube SA token
 ├── ingress.yaml                 # hermes.rogi.casa → dashboard (optional)
 └── README.md                    # this file
 ```
 Then add a line to `argocd/gen-apps.sh` `APPS=(...)`:
 ```
 "platform-engineer|platform-engineer|platform-engineer|true|true"
 ```
 and re-run `./argocd/gen-apps.sh` to generate `argocd/apps/platform-engineer.yaml`
 so ArgoCD reconciles it like every other app in the repo.
 ---
 ## 3. RBAC — least privilege
 ServiceAccount `platform-engineer` in ns `platform-engineer`, bound to a
 **ClusterRole** scoped to *platform engineer* actions:
 **Read (get/list/watch):** nodes, pods, services, deployments, statefulsets,
 daemonsets, replicasets, jobs, cronjobs, events, configmaps, secrets, PVCs,
 ingresses, namespaces.
 **Act (patch/update on a allowlist):**
 - `pods` → `delete` (force-restart a stuck pod), `patch` (`/evict`, annotations)
 - `deployments`, `statefulsets`, `daemonsets`, `replicasets` → `patch` (restart
  via `kubectl rollout restart` / scale), `update`
 - `jobs`, `cronjobs` → `delete`, `patch`
 - `pods/exec` (subresource) → `create` (only if we want the agent to `kubectl
  exec` into pods for log-style debugging — optional; keep off initially)
 - `events` → `get/list/watch` only
 **No cluster-scoped writes** (no creating namespaces, no node taints, no RBAC
 edits, no CRDs). The agent can *propose* those and tell me; it cannot do them
 itself. All mutating calls are auditable via Kubernetes audit logs and
 `kubectl auth can-i --as=system:serviceaccount:platform-engineer:platform-engineer`.
 The pod uses the k3s in-cluster ServiceAccount token (`/var/run/secrets/...
 /serviceaccount/token`) + the `KUBERNETES_SERVICE_HOST/PORT` env vars k3s already
 injects — **no kubeconfig file, no long-lived token on disk**.
 ---
 ## 4. Image — thin derived Dockerfile
 ```dockerfile
 FROM nousresearch/hermes-agent:latest
 USER root
 RUN apt-get update \
 && apt-get install -y --no-install-recommends curl gnupg \
 && curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key \
    | gpg --dearmor -o /usr/share/keyrings/kubernetes-apt-keyring.gpg \
 && echo 'deb [signed-by=/usr/share/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /' \
    > /etc/apt/sources.list.d/kubernetes.list \
 && apt-get update \
 && apt-get install -y --no-install-recommends kubectl \
 && curl -fsSL https://get.helm.sh/helm-v3.16.0-linux-amd64.tar.gz \
    | tar -xz -C /usr/local/bin --strip-components=1 linux-amd64/helm \
 && rm -rf /var/lib/apt/lists/*
 USER hermes
 ```
 > Note: the cluster is mixed arch (arm64/amd64/arm). The agent pod is pinned to
 > the amd64 node, so `linux-amd64` helm + `kubectl` packages are fine. If you
 > later want it portable, switch to a multi-arch build with
 > `TARGETARCH` and install matching helm arch.
 Build & push to your Gitea registry (`git.rogi.casa/roger/...`) — same
 `imagePullSecrets: gitea-registry` pattern as `gym-tracker`. Tag with the
 hermes version + a short git sha.
 ---
 ## 5. Hermes configuration (mounted via ConfigMap → /opt/data/config.yaml)
 ```yaml
 # config.yaml (seeded into the PVC on first boot)
 model:
  provider: openai-api
  default: claude-4.5-haiku
  base_url: "https://litellm.rogi.casa/v1"
  api_mode: chat_completions
 # Use a cheap, fast model for auxiliary tasks (titling, compression)
 auxiliary:
  compression:
    provider: openai-api
    model: gemini-3-flash
  title_generation:
    provider: openai-api
    model: gemini-3-flash
 terminal:
  backend: local
  cwd: /workspace            # a working dir for any kubectl output / scratch
  timeout: 180
  home_mode: profile        # isolate tool credentials under HERMES_HOME/home
 # Unattended gateway → circuit-breaker on tool-call loops
 tool_loop_guardrails:
  hard_stop_enabled: true
  hard_stop_after:
    exact_failure: 5
    idempotent_no_progress: 5
 sessions:
  auto_prune: true
  retention_days: 90
 cron:
  wrap_response: false      # cleaner Discord messages
 memory:
  memory_enabled: true
  user_profile_enabled: true
 ```
 `.env` (from Secret, mounted to `/opt/data/.env`):
 ```
 OPENAI_API_KEY=<LITELLM_API_KEY value, i.e. sk-...>
 OPENAI_BASE_URL=https://litellm.rogi.casa/v1
 DISCORD_BOT_TOKEN=<new dedicated bot token>
 DISCORD_HOME_CHANNEL=<your user/channel id for alerts>
 # Dashboard auth (homelab, trusted LAN behind ingress)
 HERMES_DASHBOARD_BASIC_AUTH_USERNAME=roger
 HERMES_DASHBOARD_BASIC_AUTH_PASSWORD=<strong password>
 ```
 > Why `OPENAI_API_KEY` + `OPENAI_BASE_URL`: the `openai-api` provider honours
 > `OPENAI_BASE_URL`, so this is the simplest way to point Hermes at the
 > in-cluster LiteLLM. `claude-4.5-haiku` / `gemini-3-flash` are the model names
 > already exposed by your `litellm/litellm.yaml` ConfigMap.
 `SOUL.md` (personality + guardrails) — see `configmap.yaml`. Key points:
 - Identity: "Platform Engineer for the rogi.casa k3s cluster."
 - Knows the cluster layout (3 nodes, ArgoCD GitOps, Traefik+cert-manager,
  LiteLLM, services list).
 - Operating rules: read-first; only act on the allowlisted verbs; never edit
  RBAC / taints / namespaces / CRDs; when in doubt, notify instead of acting;
  always cite the resource and the command used.
 - How to reach me: `deliver="discord"`.
 ---
 ## 6. Deployment
 - `replicas: 1` (Hermes data dir is single-writer — never scale >1).
 - `nodeSelector: kubernetes.io/arch: amd64` + preferred `hardware: high-memory`
  affinity → lands on the NUC.
 - `resources`: requests 512Mi/250m, limits 2Gi/1 core (Hermes recommends
  2–4 GiB; 1 GiB is fine without browser tools, which we keep off).
 - Volume: PVC mounted at `/opt/data` (HERMES_HOME), RWX not needed (single pod).
 - Ports: 8642 (gateway API, internal only) and 9119 (dashboard) → exposed via
  Ingress `hermes.rogi.casa` with TLS + basic-auth (already enforced by the
  `HERMES_DASHBOARD_BASIC_AUTH_*` env vars).
 - `imagePullSecrets: gitea-registry`.
 - env from Secret; `HERMES_DASHBOARD=1`.
 - Init: on first boot the s6 `01-hermes-setup` hook seeds config/SOUL/.env from
  the ConfigMap if the volume is empty. We mount the ConfigMap as a readonly
  projection at `/opt/seed/` and run a tiny initContainer to copy it into
  `/opt/data` only when `/opt/data/config.yaml` doesn't exist (so ArgoCD
  self-heal never fights the agent's live-edited config).
 ---
 ## 7. Cron jobs to seed (Hermes-native)
 These are written by an init script (one-shot Job `hermes-cron-seed`) that runs
 `hermes cron create ...` against the gateway on first install, and is idempotent
 (it checks existing job names). All deliver to Discord. Examples:
 | Name | Schedule | Prompt (abbreviated) |
 |------|----------|------------------------|
 | `cluster-health-check` | `every 15m` | Run `kubectl get nodes,pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded` and `kubectl get events -A --field-selector type=Warning --since=20m`. If everything healthy, reply with only `[SILENT]`. Otherwise summarize failures and root-cause briefly. |
 | `pod-restart-loop` | `every 10m` | Find pods in `CrashLoopBackOff`/`ImagePullBackOff` across all namespaces. For `CrashLoopBackOff`, fetch logs and if a clear transient cause (OOM, config parse, missing secret) is visible, attempt `kubectl rollout restart <deploy>`; otherwise notify me with the log excerpt. Reply `[SILENT]` if none found. |
 | `pvc-pressure` | `every 30m` | `kubectl get pv` + node disk via `kubectl top nodes`. Alert if any PVC `Bound` to a near-full volume or node disk >85%. `[SILENT]` otherwise. |
 | `argocd-sync-health` | `every 1h` | `kubectl get applications -n argocd -o wide` (or `argocd app sync --dry-run` if CLI present). Report any `OutOfSync`/`Degraded` app. `[SILENT]` if all `Synced`+`Healthy`. |
 | `cert-expiry` | `every 1d at 09:00` | List cert-manager `Certificate` resources with expiry < 21 days. Notify only if any. `[SILENT]` otherwise. |
 | `node-resource-drift` | `every 30m` | `kubectl top nodes`. Alert if any node CPU>90% or mem>90% sustained, or any node `NotReady`. `[SILENT]` otherwise. |
 | `daily-cluster-report` | `0 8 * * *` | Summarize: node count/status, top 5 pods by CPU/mem, # pods not Running, # ArgoCD apps OutOfSync, cert warnings. Always deliver (no `[SILENT]`). |
 Design rules baked into SOUL.md:
 - **Read-only checks** run frequently (10–30m) and stay silent unless wrong.
 - **Mutating actions** are restricted to safe idempotent ones (rollout restart,
  delete stuck pod so controller recreates). Anything riskier → notify me with
  a proposed command and wait for me to run it (I can reply in Discord to the
  continuable thread).
 - Cron sessions are isolated and **cannot create new cron jobs** (Hermes
  disables that inside cron runs) → no runaway loops.
 ---
 ## 8. Safety & guardrails
 1. **RBAC is the real boundary.** Even if the agent goes rogue, the SA can't
   touch other namespaces' secrets beyond read, can't change RBAC, can't taint
   nodes, can't create namespaces.
 2. **`tool_loop_guardrails.hard_stop_enabled: true`** — circuit-breaks a stuck
   gateway (recommended in the Docker doc for unattended deployments).
 3. **`skills.write_approval: false` but `memory.write_approval: true`** (so the
   agent can build skills/memories but I review memory writes lazily — flip
   this if it gets noisy).
 4. **No `pods/exec` subresource** initially (keep the agent from shelling into
   workloads). Enable later only if you want log-grep-style debugging.
 5. **Dashboard behind ingress TLS + basic auth** (the June-2026 hardening makes
   auth mandatory on non-loopback binds; we satisfy it with the bundled
   basic-auth provider).
 6. **Single replica / single-writer PVC** — the Docker doc is explicit that two
   gateways on the same `/opt/data` corrupt session/memory stores. Use a
   `podAntiAffinity` so an accidental scale-up doesn't co-run.
 7. **ArgoCD interaction:** keep `syncPolicy.automated.prune+selfHeal` but
   exclude the live-edited hermes state. Practically: Argo owns the *manifests*
   (deployment, configmap, secret, pvc), while `/opt/data` (config.yaml,
   cron/jobs.json, SOUL.md edits made via the dashboard) is runtime state on the
   PVC and is *not* reconciled by Argo. The ConfigMap only *seeds* it on first
   boot. Document this clearly in the README so future-you doesn't expect Argo
   to reset the agent's personality.
 ---
 ## 9. Rollout plan
 1. Build & push the derived image to `git.rogi.casa/roger/hermes-agent` (tag
   `v1.35-<sha>`).
 2. Create the namespace + RBAC + Secret + ConfigMap + PVC:
   `kubectl apply -f platform-engineer/`.
 3. Create the `platform-engineer` Discord bot, invite it, put its token + your
   channel id in `secret.yaml` (base64).
 4. Apply the Deployment; wait for the pod to go Running.
 5. `kubectl exec` in and run the one-shot cron seed:
   `hermes cron create ...` (or apply the `cron-seed` Job).
 6. Trigger the first `cluster-health-check` manually: `hermes cron run cluster-health-check`.
 7. Add the app to `argocd/gen-apps.sh`, regenerate, commit, push.
 ---
 ## 10. Decisions (locked in)
 1. **Notifications:** dedicated `platform-engineer` Discord bot → its own token
   in `secret.yaml` (`DISCORD_BOT_TOKEN`, `DISCORD_HOME_CHANNEL`).
 2. **Dashboard:** public at `hermes.rogi.casa` (Traefik TLS + cert-manager + the
   bundled Hermes basic-auth provider). Reach the dashboard on port 9119; the
   gateway API on 8642 is ClusterIP-only.
 3. **Image:** derived image pushed to `git.rogi.casa/roger/hermes-agent`, pulled
   via the existing `gitea-registry` imagePullSecret (must also exist in the
   `platform-engineer` ns — see deploy steps).
 4. **Model:** `qwen-3.6:27b` via the in-cluster Ollama box (`10.88.20.12:11434`),
   exposed through LiteLLM as `qwen-3.6:27b`. Added to `litellm/litellm.yaml`.
   Hermes reaches LiteLLM at `https://litellm.rogi.casa/v1` (never Ollama directly).
 5. **pods/exec:** granted (`pods/exec` → `create` in the ClusterRole) so the
   agent can `kubectl exec`/`kubectl logs` for debugging.
 ---
 ## 11. Deployment checklist (do in this order)
 1. **Add the Ollama model to LiteLLM** (already done in `litellm/litellm.yaml`):
   the `qwen-3.6:27b` entry points at `http://10.88.20.12:11434`. Make sure
   `qwen3.6:27b` is actually pulled on that Ollama host
   (`ollama pull qwen3.6:27b`). Apply: `kubectl apply -f litellm/` and restart
   the LiteLLM pod so the new config takes effect.
 2. **Create the `gitea-registry` secret in the new namespace** (ArgoCD won't
   create it — it's not in the repo):
   ```
   kubectl create namespace platform-engineer
   kubectl create secret docker-registry gitea-registry \
     --docker-server=git.rogi.casa \
     --docker-username=<your-gitea-user> \
     --docker-password=<gitea-access-token> \
     --docker-email=<your-email> \
     -n platform-engineer
   ```
 3. **Build & push the image:** `./platform-engineer/build-and-push.sh`
   (after `docker login git.rogi.casa`).
 4. **Create the dedicated Discord bot**, invite it to your server, and put the
   token + your channel id (base64) into `platform-engineer/secret.yaml`. Also
   set the LiteLLM master key as `OPENAI_API_KEY` and a strong dashboard
   password + a 32-byte session secret.
 5. **Commit & push** the whole change. ArgoCD will create the namespace
   resources, deploy the pod, and bring up the ingress at `hermes.rogi.casa`.
 6. **Seed the cron jobs:**
   `kubectl apply -f platform-engineer/cron-seed.yaml` (one-shot Job) — it waits
   for the hermes pod, then runs `hermes cron create ...` for each watchdog.
   Re-run it any time you want to re-seed after a wipe.
 7. **Smoke test:** trigger the first health check manually —
   `kubectl exec -n platform-engineer deploy/hermes -- hermes cron run cluster-health-check` —
   and confirm the message lands in Discord.
 8. **ArgoCD:** the `Application` (`argocd/apps/platform-engineer.yaml`) is
   already generated. After commit, Argo will reconcile it like every other app.
 ## 12. What ArgoCD owns vs. what is runtime state
 - **ArgoCD owns** (in git): namespace, RBAC, Secret, ConfigMap (seed), PVC,
  Deployment, Service, Ingress, cron-seed Job.
 - **Runtime state (on the PVC, NOT reconciled):** `config.yaml`, `SOUL.md`,
  `.env`, `cron/jobs.json`, `sessions/`, `memories/`, `skills/`. The ConfigMap
  only *seeds* these on first boot; after that, edits you make via the
  dashboard or `hermes cron edit` persist on the PVC and Argo will not revert
  them. If you ever want a hard reset, delete the PVC and re-apply.
 ---
 ## Files in this directory
 | File | Purpose |
 |------|---------|
 | `namespace.yaml` | namespace `platform-engineer` |
 | `rbac.yaml` | ServiceAccount + ClusterRole (+binding), least-privilege |
 | `configmap.yaml` | seed `config.yaml` + `SOUL.md` |
 | `secret.yaml` | Discord token, LiteLLM key, dashboard auth (PLACEHOLDERS — fill in) |
 | `pvc.yaml` | 5 Gi PVC for `/opt/data` |
 | `dockerfile` | derived image: hermes-agent + kubectl + helm (linux/amd64) |
 | `build-and-push.sh` | builds & pushes the image to the Gitea registry |
 | `deployment.yaml` | Deployment (1 replica, Recreate, pinned to amd64 NUC) + Service |
 | `ingress.yaml` | `hermes.rogi.casa` → dashboard (TLS + basic auth) |
 | `cron-seed.yaml` | one-shot Job that creates the Hermes cron schedule |
 Also changed outside this directory:
 - `litellm/litellm.yaml` — added `qwen-3.6:27b` model entry.
 - `argocd/gen-apps.sh` + `argocd/apps/platform-engineer.yaml` — ArgoCD
  Application for this folder.
 ```
--- a/platform-engineer/build-and-push.sh
+++ b/platform-engineer/build-and-push.sh
@@ -0,0 +1,24 @@
 #!/usr/bin/env bash
 # Build & push the derived Hermes image (kubectl + helm) to the Gitea registry.
 #
 # Run this on a machine with docker + access to git.rogi.casa:
 #   ./platform-engineer/build-and-push.sh
 #
 # Prereqs:
 #   - docker login git.rogi.casa  (use your Gitea username + access token)
 set -euo pipefail
 REGISTRY="git.rogi.casa"
 REPO="roger/hermes-agent"
 TAG="${TAG:-v1.35-1}"
 IMAGE="${REGISTRY}/${REPO}:${TAG}"
 cd "$(dirname "$0")"
 echo "==> Building ${IMAGE}"
 docker build --platform linux/amd64 -t "${IMAGE}" -f dockerfile .
 echo "==> Pushing ${IMAGE}"
 docker push "${IMAGE}"
 echo "==> Done. Update platform-engineer/deployment.yaml image: if you changed TAG."
--- a/platform-engineer/configmap.yaml
+++ b/platform-engineer/configmap.yaml
@@ -0,0 +1,115 @@
 # Hermes configuration, SOUL.md, and the cron-seed script.
 # Seeded into the PVC (/opt/data) by the initContainer on first boot only.
 ---
 apiVersion: v1
 kind: ConfigMap
 metadata:
  name: hermes-seed
  namespace: platform-engineer
 data:
  config.yaml: |
    model:
      provider: openai-api
      default: qwen-3.6:27b
      base_url: "https://litellm.rogi.casa/v1"
      api_mode: chat_completions
    # Cheap/fast model for auxiliary tasks (titling, compression).
    auxiliary:
      compression:
        provider: openai-api
        model: qwen-3.6:27b
        base_url: "https://litellm.rogi.casa/v1"
      title_generation:
        provider: openai-api
        model: qwen-3.6:27b
        base_url: "https://litellm.rogi.casa/v1"
    terminal:
      backend: local
      cwd: /workspace
      timeout: 180
      home_mode: profile
    # Unattended gateway → circuit-break on stuck tool-call loops.
    tool_loop_guardrails:
      hard_stop_enabled: true
      hard_stop_after:
        exact_failure: 5
        idempotent_no_progress: 5
    sessions:
      auto_prune: true
      retention_days: 90
    cron:
      wrap_response: false
    memory:
      memory_enabled: true
      user_profile_enabled: true
      write_approval: false
    skills:
      write_approval: false
  SOUL.md: |
    # Platform Engineer — rogi.casa k3s cluster
    You are the autonomous Platform Engineer for the `rogi.casa` K3s cluster.
    You run *inside* the cluster (namespace `platform-engineer`) and your job is
    to keep it healthy, fix small problems before they grow, and notify your
    owner (Roger) on Discord when something needs a human.
    ## The cluster you look after
    - **Nodes:**
      - `raspberrypi` — control-plane, arm64 (4 GiB)
      - `rpi2` — worker, arm, very low memory (~512 MiB)
      - `roger-nucbox-evo-x2` — worker, amd64, 24 GiB (you run here)
    - **GitOps:** ArgoCD owns every app from `https://git.rogi.casa/roger/k3s-cluster.git`.
      Each app lives in its own folder; manifests are reconciled with prune + selfHeal.
    - **Ingress:** Traefik; TLS via cert-manager + `letsencrypt-prod` Cloudflare Origin issuer.
    - **LLM gateway:** LiteLLM at `https://litellm.rogi.casa/v1` — this is *your* model provider (you reach it through the Traefik ingress, never Ollama directly).
    - **Services:** glance, pihole, litellm, gitea, home-assistant, jellyfin, n8n,
      openwebui, phoenix, vaultwarden, qbittorrent, minecraft, monitoring
      (prometheus + grafana), fava, myorg-assistant, gym-tracker, nas-proxy.
    - **Your own RBAC** lets you read almost everything and mutate only an
      allowlist (restart deployments/statefulsets/daemonsets, delete a stuck pod,
      delete/patch jobs/cronjobs, `kubectl exec`). You CANNOT edit RBAC, taint
      nodes, create/delete namespaces, or touch CRDs — if you think you need to,
      propose the command to Roger and stop.
    ## Operating rules
    1. **Read first, act second.** Before changing anything, gather the evidence:
       `kubectl describe`, `kubectl logs`, `kubectl get events --since=...`,
       `kubectl top`. Cite the exact resource (ns/name) and the exact command in
       every report.
    2. **Only safe, idempotent remediations.** Allowed actions:
       - `kubectl rollout restart deployment/<name> -n <ns>` (and statefulset/daemonset)
       - delete a single stuck `CrashLoopBackOff`/`ImagePullBackOff` pod so its
         controller recreates it
       - `kubectl delete job/<name>` / `kubectl patch cronjob ...`
       Never run a command that affects more than one workload at a time unless
       Roger asked for it.
    3. **When in doubt, notify, don't act.** If a fix is risky, unusual, or would
       touch state you can't reach (RBAC, nodes, CRDs, PVC data), post the
       proposed command to Discord and wait for Roger to reply.
    4. **Be quiet when healthy.** Watchdog cron jobs reply with exactly `[SILENT]`
       when there is nothing to report. Failed jobs always deliver regardless.
    5. **No runaway loops.** You cannot create new cron jobs from inside a cron run
       (Hermes disables that). Do not try.
    6. **Talk like an engineer.** Short, concrete, with resource names and
       commands. No filler. When you fixed something, say what you did in one line.
    7. **Respect GitOps.** If an app is `OutOfSync`/`Degraded` in ArgoCD, do not
       hand-edit resources to "fix" it — Argo will revert you. Report it so Roger
       can fix the source repo.
    ## How you reach Roger
    Notifications go to Discord (your home channel). Cron jobs deliver there by
    default (`deliver="discord"`). Keep messages under ~1800 chars; attach
       longer logs as `kubectl logs ... > /opt/data/cron/output/<file>` and link
       the path.
    ```
--- a/platform-engineer/cron-seed.yaml
+++ b/platform-engineer/cron-seed.yaml
@@ -0,0 +1,74 @@
 # One-shot Job that seeds Hermes' built-in cron schedule on first install.
 # Idempotent: skips job names that already exist.
 #
 # The agent's own cron jobs live in /opt/data/cron/jobs.json on the PVC and are
 # NOT reconciled by ArgoCD (runtime state). Re-run this Job manually after a
 # wipe to re-seed:  kubectl job restart hermes-cron-seed -n platform-engineer
 ---
 apiVersion: batch/v1
 kind: Job
 metadata:
  name: hermes-cron-seed
  namespace: platform-engineer
  labels:
    app: hermes
 spec:
  backoffLimit: 4
  ttlSecondsAfterFinished: 86400
  template:
    metadata:
      labels:
        app: hermes
    spec:
      serviceAccountName: platform-engineer
      restartPolicy: OnFailure
      containers:
      - name: seed
        image: bitnami/kubectl:1.35
        command: ["sh", "-c"]
        args:
        - |
          set -e
          echo "Waiting for hermes pod to be Ready..."
          kubectl -n platform-engineer wait --for=condition=Ready pod -l app=hermes --timeout=300s || true
          POD=$(kubectl -n platform-engineer get pod -l app=hermes -o jsonpath='{.items[0].metadata.name}')
          echo "Using pod: $POD"
          exists() { kubectl -n platform-engineer exec "$POD" -- hermes cron list 2>/dev/null | grep -qi "name=$1\| $1 "; }
          create() {
            name="$1"; schedule="$2"; deliver="$3"; prompt="$4"
            if exists "$name"; then
              echo "cron job '$name' already exists — skipping"
            else
              echo "creating cron job '$name' ..."
              kubectl -n platform-engineer exec "$POD" -- hermes cron create "$schedule" "$prompt" --name "$name" --deliver "$deliver"
            fi
          }
          # ---- Watchdog checks (silent unless something is wrong) ----
          create "cluster-health-check" "every 15m" "discord" \
            "Run: kubectl get nodes; kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded; kubectl get events -A --field-selector type=Warning --since=20m. If everything is healthy and there are no Warning events, reply with exactly [SILENT]. Otherwise give a concise per-resource summary of what is wrong (node name, pod ns/name, phase, last event)."
          create "pod-restart-loop" "every 10m" "discord" \
            "Find pods in CrashLoopBackOff or ImagePullBackOff across all namespaces (kubectl get pods -A). For each, fetch kubectl logs (previous) and describe. If the cause is clearly transient (OOM kill, a one-off config parse error that will retry cleanly, a missing Secret the controller will recreate), attempt ONE safe remediation: kubectl rollout restart of the owning Deployment/StatefulSet/DaemonSet, OR delete the single stuck pod. Report what you did in one line per resource. If the cause is not clearly transient (bad image, missing config, auth failure), do NOT act — post the log excerpt and the proposed command and wait for Roger. If no such pods exist, reply [SILENT]."
          create "pvc-pressure" "every 30m" "discord" \
            "Check cluster storage health: kubectl get pv,pvc -A; kubectl top nodes. Alert if any PVC is Pending/Lost or any node filesystem usage is over 85%. If all healthy, reply [SILENT]."
          create "argocd-sync-health" "every 1h" "discord" \
            "Run: kubectl get applications -n argocd -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status. If every app is Synced and Healthy, reply [SILENT]. Otherwise list the OutOfSync/Degraded apps with their status. Do NOT hand-edit resources to fix them (Argo will revert) — just report."
          create "cert-expiry" "0 9 * * *" "discord" \
            "List all cert-manager Certificate resources (kubectl get certificates -A). For each, check notAfter. Alert on any certificate expiring in under 21 days. If none, reply [SILENT]."
          create "node-resource-drift" "every 30m" "discord" \
            "Run kubectl top nodes. If any node CPU or memory usage is over 90%, or any node is NotReady, report it with the numbers. Otherwise reply [SILENT]."
          # ---- Daily report (always delivered) ----
          create "daily-cluster-report" "0 8 * * *" "discord" \
            "Produce a daily cluster report for Roger: (1) node count + Ready/NotReady; (2) top 5 pods by CPU and by memory across all namespaces (kubectl top pods -A --sort-by); (3) count of pods not Running; (4) ArgoCD apps OutOfSync or Degraded; (5) any certificates expiring within 30 days; (6) any recent Warning events (last 24h). Keep it under 1800 chars. Always deliver (no [SILENT])."
          echo "Done. Listing all cron jobs:"
          kubectl -n platform-engineer exec "$POD" -- hermes cron list
--- a/platform-engineer/deployment.yaml
+++ b/platform-engineer/deployment.yaml
@@ -0,0 +1,134 @@
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: hermes
  namespace: platform-engineer
  labels:
    app: hermes
 spec:
  replicas: 1   # MUST be 1 — Hermes' /opt/data is single-writer.
  strategy:
    type: Recreate   # never run two pods against the same PVC
  selector:
    matchLabels:
      app: hermes
  template:
    metadata:
      labels:
        app: hermes
    spec:
      serviceAccountName: platform-engineer
      imagePullSecrets:
      - name: gitea-registry
      # Pin to the powerful amd64 node (image is linux/amd64; the NUC has 24 GiB).
      nodeSelector:
        kubernetes.io/arch: amd64
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: hardware
                operator: In
                values: ["high-memory"]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: hermes
              topologyKey: kubernetes.io/hostname
      initContainers:
      # Seed /opt/data with config.yaml + SOUL.md on first boot only.
      # ArgoCD owns the manifests; the PVC is runtime state and is NOT reconciled.
      - name: seed-data
        image: busybox:1.36
        command: ["sh", "-c"]
        args:
        - |
          set -e
          if [ ! -f /opt/data/config.yaml ]; then
            echo "First boot: seeding /opt/data from ConfigMap..."
            cp /seed/config.yaml /opt/data/config.yaml
            cp /seed/SOUL.md /opt/data/SOUL.md
            chmod 600 /opt/data/config.yaml
          else
            echo "/opt/data already initialized — leaving runtime state intact."
          fi
          mkdir -p /opt/data/home/.kube /opt/data/cron/output /opt/data/scripts /workspace
        volumeMounts:
        - name: data
          mountPath: /opt/data
        - name: seed
          mountPath: /seed
      containers:
      - name: hermes
        image: git.rogi.casa/roger/hermes-agent:v1.35-1
        imagePullPolicy: Always
        command: ["gateway", "run"]
        ports:
        - name: gateway
          containerPort: 8642
        - name: dashboard
          containerPort: 9119
        envFrom:
        - secretRef:
            name: hermes-env
        env:
        # k3s injects these automatically; kubectl inside the pod uses the SA token.
        - name: HERMES_HOME
          value: /opt/data
        volumeMounts:
        - name: data
          mountPath: /opt/data
        - name: workspace
          mountPath: /workspace
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8642
          initialDelaySeconds: 60
          periodSeconds: 30
          failureThreshold: 3
        securityContext:
          allowPrivilegeEscalation: false
          runAsNonRoot: false   # official image runs as root for s6 init then drops to hermes
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: hermes-data
      - name: workspace
        emptyDir: {}
      - name: seed
        configMap:
          name: hermes-seed
 ---
 apiVersion: v1
 kind: Service
 metadata:
  name: hermes
  namespace: platform-engineer
 spec:
  type: ClusterIP
  selector:
    app: hermes
  ports:
    - name: gateway
      port: 80
      targetPort: 8642
    - name: dashboard
      port: 9119
      targetPort: 9119
--- a/platform-engineer/dockerfile
+++ b/platform-engineer/dockerfile
@@ -0,0 +1,31 @@
 # Derived Hermes Agent image with kubectl + helm so the agent can drive the
 # k3s cluster from inside the container (terminal backend = local).
 #
 # Build & push to the Gitea registry:
 #   docker build -t git.rogi.casa/roger/hermes-agent:v1.35-1 -f dockerfile .
 #   docker push git.rogi.casa/roger/hermes-agent:v1.35-1
 #
 # This image targets linux/amd64 (the agent pod is pinned to the amd64 NUC).
 FROM nousresearch/hermes-agent:latest
 USER root
 # kubectl (v1.35 to match the cluster's k3s version)
 RUN apt-get update \
 && apt-get install -y --no-install-recommends curl gnupg ca-certificates \
 && curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key \
    | gpg --dearmor -o /usr/share/keyrings/kubernetes-apt-keyring.gpg \
 && echo 'deb [signed-by=/usr/share/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /' \
    > /etc/apt/sources.list.d/kubernetes.list \
 && apt-get update \
 && apt-get install -y --no-install-recommends kubectl \
 # helm
 && curl -fsSL https://get.helm.sh/helm-v3.16.3-linux-amd64.tar.gz \
    | tar -xz -C /usr/local/bin --strip-components=1 linux-amd64/helm \
 && apt-get clean \
 && rm -rf /var/lib/apt/lists/*
 # Hermes' own CLI/kubeconfig helper dir for tool subprocesses
 RUN mkdir -p /opt/data/home/.kube
 USER hermes
--- a/platform-engineer/ingress.yaml
+++ b/platform-engineer/ingress.yaml
@@ -0,0 +1,24 @@
 apiVersion: networking.k8s.io/v1
 kind: Ingress
 metadata:
  name: hermes
  namespace: platform-engineer
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
 spec:
  ingressClassName: traefik
  tls:
  - hosts:
    - hermes.rogi.casa
    secretName: hermes-tls
  rules:
  - host: hermes.rogi.casa
    http:
      paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: hermes
              port:
                number: 9119   # dashboard
--- a/platform-engineer/namespace.yaml
+++ b/platform-engineer/namespace.yaml
@@ -0,0 +1,4 @@
 apiVersion: v1
 kind: Namespace
 metadata:
  name: platform-engineer
--- a/platform-engineer/pvc.yaml
+++ b/platform-engineer/pvc.yaml
@@ -0,0 +1,11 @@
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
  name: hermes-data
  namespace: platform-engineer
 spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
--- a/platform-engineer/rbac.yaml
+++ b/platform-engineer/rbac.yaml
@@ -0,0 +1,111 @@
 # Least-privilege RBAC for the Platform Engineer Hermes agent.
 #
 # The agent can READ almost everything cluster-wide, but can only MUTATE a
 # narrow allowlist of safe, idempotent resources (restart deployments, delete a
 # stuck pod so its controller recreates it, etc.). It CANNOT touch RBAC, nodes,
 # namespaces, CRDs, or other namespaces' Secrets beyond read.
 ---
 apiVersion: v1
 kind: ServiceAccount
 metadata:
  name: platform-engineer
  namespace: platform-engineer
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRole
 metadata:
  name: platform-engineer
 rules:
  # ---- Broad read access (cluster-wide) ----
  - apiGroups: [""]
    resources:
      - nodes
      - nodes/proxy
      - services
      - endpoints
      - pods
      - pods/log
      - configmaps
      - secrets
      - persistentvolumeclaims
      - persistentvolumes
      - namespaces
      - events
      - replicationcontrollers
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources:
      - deployments
      - statefulsets
      - daemonsets
      - replicasets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources:
      - jobs
      - cronjobs
    verbs: ["get", "list", "watch"]
  - apiGroups: ["networking.k8s.io"]
    resources:
      - ingresses
    verbs: ["get", "list", "watch"]
  - apiGroups: ["autoscaling"]
    resources:
      - horizontalpodautoscalers
    verbs: ["get", "list", "watch"]
  - apiGroups: ["argoproj.io"]
    resources:
      - applications
      - appprojects
    verbs: ["get", "list", "watch"]
  - apiGroups: ["cert-manager.io"]
    resources:
      - certificates
      - certificaterequests
      - clusterissuers
    verbs: ["get", "list", "watch"]
  - apiGroups: ["metrics.k8s.io"]
    resources:
      - pods
      - nodes
    verbs: ["get", "list"]
  # ---- Metrics / health endpoints ----
  - nonResourceURLs: ["/metrics", "/metrics/*"]
    verbs: ["get"]
  # ---- Narrow mutate allowlist (idempotent, safe remediation) ----
  # Restart a stuck pod by deleting it (its controller recreates it).
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["delete", "patch"]
  # `kubectl rollout restart` and scaling for the apps/batch controllers.
  - apiGroups: ["apps"]
    resources:
      - deployments
      - statefulsets
      - daemonsets
      - replicasets
    verbs: ["patch", "update"]
  - apiGroups: ["batch"]
    resources:
      - jobs
      - cronjobs
    verbs: ["patch", "update", "delete"]
  # Exec into pods for log-style / debug inspection (granted per request #5).
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create"]
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
 metadata:
  name: platform-engineer
 roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: platform-engineer
 subjects:
  - kind: ServiceAccount
    name: platform-engineer
    namespace: platform-engineer