gitea registry ingress

new platform engineer agent
monitoring: add dashboard ideas doc
2026-06-27 11:46:53 +02:00 · 2026-06-27 00:09:39 +02:00 · 2026-06-26 20:22:54 +02:00 · 2026-06-26 19:48:17 +02:00 · 2026-06-26 19:01:08 +02:00 · 2026-06-26 18:54:17 +02:00
28 changed files with 3019 additions and 41 deletions
--- a/argocd/apps/platform-engineer.yaml
+++ b/argocd/apps/platform-engineer.yaml
@@ -0,0 +1,24 @@
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: platform-engineer
+  namespace: argocd
+  annotations:
+    argocd.argoproj.io/sync-wave: "0"
+spec:
+  project: k3s-cluster
+  source:
+    repoURL: https://git.rogi.casa/roger/k3s-cluster.git
+    targetRevision: main
+    path: platform-engineer
+    directory:
+      recurse: true
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: platform-engineer
+  syncPolicy:
+    automated:
+      prune: true
+      selfHeal: true
+    syncOptions:
+      - CreateNamespace=false
--- a/argocd/gen-apps.sh
+++ b/argocd/gen-apps.sh
@@ -38,6 +38,7 @@ APPS=(
  "openwebui|openwebui|openwebui|true|true"
  "phoenix|phoenix|phoenix|true|false"
  "pihole|pihole|pihole|true|true"
+  "platform-engineer|platform-engineer|platform-engineer|true|true"
  "qbittorrent|qbittorrent|qbittorrent|true|true"
  "vaultwarden|vaultwarden|vaultwarden|true|true"
 )
--- a/gitea/registry-ingress.yaml
+++ b/gitea/registry-ingress.yaml
@@ -0,0 +1,42 @@
+# Dedicated DNS-only hostname for the Gitea container registry.
+#
+# WHY: Docker registry pushes can't go through the Cloudflare proxy, which caps
+# request bodies at 100 MB (413 Payload Too Large). `registry.rogi.casa` is a
+# DNS-only (grey-cloud) record in Cloudflare pointing straight at the cluster,
+# so Traefik serves it directly with a Let's Encrypt cert (HTTP-01). Git traffic
+# on `git.rogi.casa` stays behind the Cloudflare proxy untouched.
+#
+# Cloudflare setup:
+#   A    registry.rogi.casa   <cluster-public-IP>   DNS-only (grey cloud)
+#
+# Push with:
+#   docker login registry.rogi.casa -u <gitea-user>
+#   docker tag git.rogi.casa/roger/hermes-agent:v1.35-1 registry.rogi.casa/roger/hermes-agent:v1.35-1
+#   docker push registry.rogi.casa/roger/hermes-agent:v1.35-1
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: gitea-registry
+  namespace: gitea
+  annotations:
+    cert-manager.io/cluster-issuer: letsencrypt-prod
+    # Allow large docker layer uploads (no upstream body-size cap from Traefik).
+    traefik.ingress.kubernetes.io/buffering: |
+      maxRequestBodyBytes: 0
+spec:
+  ingressClassName: traefik
+  tls:
+  - hosts:
+    - registry.rogi.casa
+    secretName: gitea-registry-tls
+  rules:
+  - host: registry.rogi.casa
+    http:
+      paths:
+      - path: /
+        pathType: Prefix
+        backend:
+          service:
+            name: gitea
+            port:
+              number: 80
--- a/homeassistant/homeassistant.yaml
+++ b/homeassistant/homeassistant.yaml
@@ -32,7 +32,9 @@ data:
    http:
      use_x_forwarded_for: true
      trusted_proxies:
-        - 10.88.88.0/24
+        - 10.42.0.0/16   # k3s pod CIDR (Traefik pod lives here)
+        - 10.43.0.0/16   # k3s service CIDR
+        - 10.88.20.0/24  # node subnet (Traefik runs hostNetwork-ish, forwards from 10.88.20.11)
 ---
 apiVersion: apps/v1
 kind: Deployment
--- a/litellm/litellm.yaml
+++ b/litellm/litellm.yaml
@@ -26,7 +26,12 @@ data:
      - model_name: glm-4.7-flash
        litellm_params:
          model: ollama/glm-4.7-flash
-          api_base: http://10.88.88.235:11434
+          api_base: http://10.88.20.12:11434
+      # Used by the platform-engineer Hermes agent (deployed in ns platform-engineer).
+      - model_name: qwen-3.6:27b
+        litellm_params:
+          model: ollama/qwen3.6:27b
+          api_base: http://10.88.20.12:11434
    litellm_settings:
      #set_verbose: True  # Uncomment this if you want to see verbose logs; not recommended in production
      callbacks: ["arize_phoenix"]
--- a/minecraft-server/minecraft-server.yaml
+++ b/minecraft-server/minecraft-server.yaml
@@ -12,10 +12,12 @@ metadata:
  labels:
    app: minecraft-server
 spec:
-  type: ClusterIP
+  type: LoadBalancer
+  loadBalancerIP: 10.88.20.103
  ports:
    - name: minecraft
      port: 25565
      targetPort: 25565
+      protocol: TCP
  selector:
    app: minecraft-server
--- a/monitoring/dashboard-ideas.md
+++ b/monitoring/dashboard-ideas.md
@@ -0,0 +1,354 @@
+# Dashboard Ideas
+
+This file collects ideas for additional Grafana dashboards to build for the
+`rogi.casa` k3s cluster. Each idea notes the **data source** (metrics already
+available vs. metrics that need to be enabled) and a rough panel layout.
+
+To actually add a dashboard, create a `grafana-dashboard-<name>.yaml` ConfigMap
+in this folder, mount it in `grafana-deployment.yaml` (add a volume +
+volumeMount under `/var/lib/grafana/dashboards/<name>`), commit and push.
+
+---
+
+## Already-scraped services (ready to dashboard now)
+
+These exporters/services are **already being scraped by Prometheus** — dashboards
+can be built immediately with no infra changes.
+
+### 1. Traefik (Ingress) — `traefik_*`
+Traefik is scraped via the `kubernetes-pods` job (pod annotation on
+`traefik-9bcdbbd9-x8zq4` in `kube-system`). It exposes request counters, entry
+point latency, TLS handshakes, config reloads.
+
+**Panels:**
+- Requests/sec by entrypoint (web / websecure / traefik) — `rate(traefik_entrypoint_requests_total[5m])`
+- Request latency p50/p95/p99 — `histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le, entrypoint))`
+- HTTP status code distribution (2xx/3xx/4xx/5xx) — `rate(traefik_entrypoint_requests_total{code=~"2xx|3xx|4xx|5xx"}[5m])`
+- TLS handshakes/sec — `rate(traefik_entrypoint_requests_tls_total[5m])`
+- Config reloads + last reload success — `traefik_config_reloads_total`, `traefik_config_last_reload_success`
+- Top routes/services by request volume — `topk(10, sum by (service) (rate(traefik_service_requests_total[5m])))`
+- Bytes transferred in/out — `rate(traefik_entrypoint_requests_bytes_total[5m])`
+
+**Why useful:** This is your front door. Knowing which routes get hit most,
+latency per ingress, and 5xx spikes is the single most valuable app-level
+dashboard in the cluster.
+
+---
+
+### 2. CoreDNS (cluster DNS) — `coredns_*`
+Scraped via `kube-dns` Service annotation. Exposes query rate, cache hits,
+error types, response duration.
+
+**Panels:**
+- DNS queries/sec by zone / type — `rate(coredns_dns_requests_total[5m])`
+- Cache hit ratio — `rate(coredns_cache_hits_total[5m]) / rate(coredns_cache_requests_total[5m])`
+- DNS query latency p95 — `histogram_quantile(0.95, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))`
+- Queries by response code (NOERROR / NXDOMAIN / SERVFAIL) — `rate(coredns_dns_responses_total[5m])`
+- Cache size — `coredns_cache_entries`
+- Forward requests/sec (upstream DNS) — `rate(coredns_forward_requests_total[5m])`
+
+**Why useful:** DNS issues cause cascading failures (ImagePullBackOff, cert
+challenges, etc.). A spike in NXDOMAIN/SERVFAIL is an early warning.
+
+---
+
+### 3. MetalLB (LoadBalancer) — `metallb_*`
+Scraped via pod annotation on `speaker-*` and `controller` in `metallb-system`.
+Exposes IP allocation usage, BGP/session state.
+
+**Panels:**
+- IP addresses in use vs. total — `metallb_allocator_addresses_in_use_total` / `metallb_allocator_addresses_total`
+- IP pool utilization % (gauge) — `metallb_allocator_addresses_in_use_total / metallb_allocator_addresses_total * 100`
+- BGP session up per speaker — `metallb_bgp_session_up`
+- Config loaded / stale status — `metallb_k8s_client_config_loaded_bool`, `metallb_k8s_client_config_stale_bool`
+- Announcements per speaker — `rate(metallb_bgp_announcements_total[5m])`
+
+**Why useful:** If MetalLB runs out of IPs, new LoadBalancer services will
+hang in `<pending>`. Knowing pool utilization lets you act before that happens.
+
+---
+
+### 4. cert-manager (TLS certificates) — `certmanager_*`
+Scraped via pod annotations on cert-manager pods. Exposes certificate
+expiration, renewal, ready status, ACME challenges.
+
+**Panels:**
+- Certificate expiration (days remaining, sorted) — table of `(certmanager_certificate_not_after_timestamp_seconds - time()) / 86400`
+- Certificates not Ready — `certmanager_certificate_ready_status{condition="Ready",status!="True"}`
+- Upcoming renewals (next 14 days) — `certmanager_certificate_renewal_timestamp_seconds`
+- ACME challenge status — `certmanager_certificate_challenge_status`
+- Failed renewals counter — `rate(certmanager_certificate_renewal_total{condition="Failed"}[1h])`
+
+**Why useful:** A cert about to expire (or silently failing to renew) is the
+kind of thing that takes down `*.rogi.casa` HTTPS with no warning. This is a
+must-have alert/dashboard.
+
+---
+
+### 5. Phoenix (trace store) — `phoenix_*`
+Already scraped via the `phoenix` Service annotation. Exposes bulk loader
+ingestion rates, span insertion times, retention sweeper, exceptions.
+
+**Panels:**
+- Span ingestion rate — `rate(phoenix_bulk_loader_span_insertion_time_seconds_count[5m])`
+- Span insertion latency p95 — `histogram_quantile(0.95, sum(rate(phoenix_bulk_loader_span_insertion_time_seconds_bucket[5m])) by (le))`
+- Span exceptions/sec — `rate(phoenix_bulk_loader_span_exceptions_total[5m])`
+- Retention sweeper last run — `phoenix_retention_sweeper_last_run_seconds`
+- Last activity timestamp — `phoenix_bulk_loader_last_activity_timestamp_seconds`
+
+**Why useful:** Phoenix is your observability backend's own backend. Tracking
+ingestion health tells you whether traces are landing.
+
+---
+
+## Infrastructure dashboards (compose from existing metrics)
+
+### 6. Storage & PVC Health (KSM + kubelet + node-exporter)
+Cross-source dashboard combining `kube_persistentvolumeclaim_*` (KSM),
+`kubelet_volume_stats_*` (kubelet), and `node_filesystem_*` (node-exporter).
+
+**Panels:**
+- PVC usage % per claim — `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100`
+- PVC requested vs. capacity — `kube_persistentvolumeclaim_resource_requests_storage_bytes` vs actual
+- Node disk usage % (all mounts) — `(1 - node_filesystem_avail / node_filesystem_size) * 100`
+- Inode usage % per mount — `(1 - node_filesystem_files_free / node_filesystem_files) * 100`
+- Volume binding status (Bound/Pending) — `kube_persistentvolumeclaim_status_phase`
+- Top 10 PVCs by usage (table)
+
+**Why useful:** The `local-path` provisioner fills up node disks. Catching a
+PVC at 95% before it errors is a lifesaver.
+
+---
+
+### 7. Workload Health (KSM)
+Uses kube-state-metrics to show deployment/StatefulSet/CronJob health cluster-wide.
+
+**Panels:**
+- Deployments with unavailable replicas — `kube_deployment_status_replicas_available < kube_deployment_status_replicas`
+- Pods not in Running phase by namespace — `kube_pod_status_phase{phase!="Running"}`
+- Container restarts (last 1h) — `increase(kube_pod_container_status_restarts_total[1h])`
+- Pods stuck in CrashLoopBackOff / ImagePullBackOff — `kube_pod_container_status_waiting_reason{reason=~"CrashLoopBackOff|ImagePullBackOff"}`
+- Job failures — `kube_job_failed`
+- CronJob schedule heatmap — `kube_cronjob_status_active`
+- HPA status (if any autoscaled) — `kube_horizontalpodautoscaler_status_current_replicas` vs desired
+
+**Why useful:** This is the "is anything broken" board. Notice you already have
+some pods in `ImagePullBackOff` (myorg-assistant) — this dashboard surfaces that.
+
+---
+
+### 8. etcd / Control Plane Health (if exposed)
+k3s embeds etcd (or sqlite on single-node). etcd metrics require exposing
+the etcd `/metrics` endpoint (typically `--listen-metrics-urls` on the control
+plane node). **Requires config change to enable.**
+
+**Panels:**
+- Leader changes — `etcd_server_leader_changes_seen_total`
+- Proposal commits/sec — `rate(etcd_server_proposals_committed_total[5m])`
+- Proposal failures/sec — `rate(etcd_server_proposals_failed_total[5m])`
+- DB size — `etcd_mvcc_db_total_size_in_bytes`
+- RPC latency p99 — `histogram_quantile(0.99, sum(rate(etcd_grpc Unary grpc latency bucket[5m])) by (le))`
+- Active watchers — `etcd_debugging_mvcc_watcher_total`
+
+**Why useful:** etcd is the brain of the cluster. Slow commits or a flipping
+leader indicates control-plane trouble.
+
+---
+
+## App-service dashboards (require enabling metrics first)
+
+Most of your apps don't expose `/metrics` yet. Below is the per-service setup
+plus the dashboard idea once metrics are on. To enable scraping for any of
+these, annotate the Service with:
+
+```yaml
+metadata:
+  annotations:
+    prometheus.io/scrape: "true"
+    prometheus.io/port: "<port>"
+```
+
+The existing `kubernetes-service-endpoints` scrape job will pick them up
+automatically — **no Prometheus config edit needed**.
+
+### 9. LiteLLM (LLM gateway) — needs enabling
+LiteLLM exposes Prometheus metrics on its API port (`/metrics`). Annotate the
+`litellm` Service.
+
+**Panels:**
+- Requests/sec by model — `rate(litellm_requests_total[5m])` by `model`
+- Token usage (prompt/completion/total) — `rate(litellm_total_tokens_total[5m])`
+- Spend by model — `litellm_spend_total` (if cost tracking enabled)
+- Latency p95 per model — `histogram_quantile(0.95, ...)`
+- Error rate by model — `rate(litellm_requests_total{status=~"5.."}[5m])`
+- Rate-limit / quota hits
+
+**Why useful:** LiteLLM is the gateway for all your AI apps (open-webui,
+myorg-assistant, etc.). Token spend + per-model latency is the single best
+cost/quality lever in the cluster.
+
+---
+
+### 10. Gitea (git + CI) — needs enabling
+Gitea exposes metrics at `/metrics` when `ENABLE_METRICS=true` in `app.ini`.
+Annotate `gitea-http` Service (port 3000 inside, 80 via svc).
+
+**Panels:**
+- Git push/clone/fetch rate — `gitea_actions_total` by `action`
+- Active users / repos / orgs — `gitea_users_total`, `gitea_repos_total`
+- Issues / PRs open — `gitea_issues_total`, `gitea_pulls_total`
+- HTTP request rate + latency
+- Gitea Actions runner job duration — if runner metrics exposed
+
+**Why useful:** Gitea hosts the cluster's own GitOps repo + CI. Tracking push
+rate and runner throughput catches CI storms.
+
+---
+
+### 11. Home Assistant — needs enabling
+HA exposes Prometheus metrics via the `prometheus` integration (add to
+`configuration.yaml`). Then annotate the Service.
+
+**Panels:**
+- Active entities / sensors by domain
+- State change events/sec — `homeassistant_entity_states_total`
+- Automation triggers/sec — `homeassistant_automation_triggered_total`
+- Integrations loaded + errors
+- Database size / recorder queue depth
+- Zigbee/Z-Wave mesh health (if exposed)
+
+**Why useful:** HA is a home-critical service. Event/sec spikes often indicate
+sensor flapping or runaway automations.
+
+---
+
+### 12. Jellyfin — limited
+Jellyfin doesn't ship first-class Prometheus metrics, but you can scrape it
+via a sidecar (`jellyfin-prometheus-exporter`) or build a blackbox-style
+dashboard on the `/health` endpoint.
+
+**Panels:**
+- Active streams — from exporter
+- Transcode sessions + hw accel usage
+- Library size by media type
+- Playback errors
+
+---
+
+### 13. Pi-hole — needs enabling
+Pi-hole exposes metrics on its FTL web API; the `pihole-exporter` sidecar
+converts them to Prometheus format. Add as a sidecar container + annotate.
+
+**Panels:**
+- DNS queries/sec (total, blocked, cached, forwarded)
+- Block list size
+- Top blocked domains
+- Top permitted domains
+- Clients by query volume
+- Cache hit ratio
+
+**Why useful:** Pi-hole is your network-wide adblock. Block rate + cache ratio
+are the headline metrics, and query spikes reveal misbehaving clients.
+
+---
+
+### 14. PostgreSQL (litellm + phoenix + n8n) — needs enabling
+You have two Postgres instances (`postgres` in `litellm` and `phoenix`).
+Add `prometheus-postgres-exporter` as a sidecar or Deployment per DB.
+
+**Panels (per DB):**
+- Connections (active / idle / max) — `pg_stat_activity_count`
+- Transactions/sec — `rate(pg_stat_database_xact_commit[5m])`
+- Cache hit ratio — `pg_stat_database_blks_hit / (blks_hit + blks_read)`
+- Table + index bloat
+- Replication lag (if replicas)
+- Slow queries (if `pg_stat_statements` enabled)
+- DB size growth — `pg_database_size_bytes`
+
+**Why useful:** DB connection exhaustion and cache ratio collapse are the two
+most common causes of slow app performance.
+
+---
+
+### 15. Minecraft — limited
+The Minecraft server exposes metrics via RCON + an exporter
+(`minecraft-exporter`). Add as sidecar using the existing `RCON_PASSWORD`.
+
+**Panels:**
+- Players online — `minecraft_players_online`
+- TPS (ticks per second) — `minecraft_tps` (server health)
+- Entities loaded — `minecraft_entities_total`
+- Chunk count — `minecraft_chunks_loaded`
+- Memory used by JVM
+
+**Why useful:** TPS < 20 means lag. Player count vs. server load is the only
+real signal a Minecraft server needs.
+
+---
+
+### 16. qBittorrent — limited
+No native metrics. Options: a `qbittorrent-exporter` sidecar (uses the WebUI
+API), or a blackbox probe on the WebUI.
+
+**Panels:**
+- Download/upload speed
+- Active torrents
+- Torrent count by state (downloading/seeding/paused)
+- Disk usage in download dir
+
+---
+
+## Cluster meta dashboards
+
+### 17. Network Topology / Service Map
+Composite view: for each namespace, list services, their pods, scrape status,
+and request volume (from Traefik logs + cAdvisor network). A "what talks to
+what" overview.
+
+**Panels:**
+- Service → pod → container resource table
+- Cross-namespace network flows (if network policy logging enabled)
+- Scrape health matrix (every target up/down)
+- Ingress route → backend service map
+
+---
+
+### 18. Backup / Snapshot Status
+If you take Velero snapshots or local-path snapshots, build a dashboard on
+`velero_*` or CRD status. **Requires Velero.**
+
+**Panels:**
+- Last successful backup per namespace
+- Failed backups
+- Backup size growth
+- Restore test status
+
+---
+
+### 19. Cost / Capacity Planning
+Composite: per-namespace CPU/memory requests vs. actual usage, projected
+growth, node saturation forecast.
+
+**Panels:**
+- Requests vs. limits vs. actual (per namespace) — KSM + cAdvisor
+- Node capacity vs. allocatable
+- PVC growth trend + 30-day forecast
+- "What if I removed node X" simulation (capacity headroom)
+
+**Why useful:** Tells you when you'll need another node before you hit the wall.
+
+---
+
+## Recommended priority order
+
+If you only build a few, do them in this order (highest value-to-effort first):
+
+1. **Traefik Ingress** (#1) — already scraped, your front door
+2. **Storage & PVC Health** (#6) — local-path fills disks; high blast radius
+3. **Workload Health** (#7) — surfaces CrashLoopBackOff / ImagePullBackOff
+4. **cert-manager** (#4) — prevents silent cert expiry outages
+5. **CoreDNS** (#2) — early warning for DNS cascades
+6. **LiteLLM** (#9) — needs `prometheus.io/scrape` annotation only; big insights
+7. **MetalLB** (#3) — small but catches LoadBalancer IP exhaustion
+
+Items 8–19 are nice-to-have or require additional exporters/config.
--- a/monitoring/grafana-dashboard-cluster-overview.yaml
+++ b/monitoring/grafana-dashboard-cluster-overview.yaml
@@ -0,0 +1,331 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-dashboard-cluster-overview
+  namespace: monitoring
+  labels:
+    app: grafana
+    grafana_dashboard: "1"
+data:
+  cluster-overview.json: |
+    {
+      "annotations": {"list": []},
+      "editable": true,
+      "fiscalYearStartMonth": 0,
+      "graphTooltip": 1,
+      "id": null,
+      "links": [],
+      "liveNow": false,
+      "panels": [
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {"color": "green", "value": null}
+                ]
+              },
+              "unit": "s"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 5, "w": 4, "x": 0, "y": 0},
+          "id": 1,
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "auto",
+            "orientation": "auto",
+            "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
+            "textMode": "auto"
+          },
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "time() - max(process_start_time_seconds{job=\"prometheus\"})", "refId": "A"}],
+          "title": "Prometheus Uptime",
+          "type": "stat"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {"color": "red", "value": null},
+                  {"color": "green", "value": 1}
+                ]
+              }
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 5, "w": 4, "x": 4, "y": 0},
+          "id": 2,
+          "options": {
+            "colorMode": "background",
+            "graphMode": "none",
+            "justifyMode": "center",
+            "orientation": "horizontal",
+            "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
+            "textMode": "value_and_name"
+          },
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "count(kubelet_running_pods)", "refId": "A"}],
+          "title": "Running Pods (total)",
+          "type": "stat"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {"color": "green", "value": null}
+                ]
+              }
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 5, "w": 4, "x": 8, "y": 0},
+          "id": 3,
+          "options": {
+            "colorMode": "background",
+            "graphMode": "none",
+            "justifyMode": "center",
+            "orientation": "horizontal",
+            "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
+            "textMode": "value_and_name"
+          },
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(kubelet_running_containers)", "refId": "A"}],
+          "title": "Running Containers",
+          "type": "stat"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "mappings": [
+                {"options": {"0": {"text": "Down", "color": "red"}, "1": {"text": "Up", "color": "green"}}, "type": "value"}
+              ],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {"color": "red", "value": null},
+                  {"color": "green", "value": 1}
+                ]
+              }
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 5, "w": 12, "x": 12, "y": 0},
+          "id": 4,
+          "options": {
+            "colorMode": "background",
+            "graphMode": "none",
+            "justifyMode": "center",
+            "orientation": "horizontal",
+            "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
+            "textMode": "value_and_name"
+          },
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "up{job=\"kubernetes-apiservers\"}", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "up{job=\"kubernetes-nodes\"}", "refId": "B"}],
+          "title": "Control Plane & Node Exporters",
+          "type": "stat"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "axisCenteredZero": false,
+                "axisColorMode": "text",
+                "axisLabel": "",
+                "axisPlacement": "auto",
+                "barAlignment": 0,
+                "drawStyle": "line",
+                "fillOpacity": 10,
+                "gradientMode": "none",
+                "hideFrom": {"legend": false, "tooltip": false, "viz": false},
+                "insertNulls": false,
+                "lineInterpolation": "linear",
+                "lineWidth": 1,
+                "pointSize": 5,
+                "scaleDistribution": {"type": "linear"},
+                "showPoints": "never",
+                "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"},
+                "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "bytes"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 5},
+          "id": 10,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(container_memory_working_set_bytes{container!=\"\",container!=\"POD\"}) by (namespace)", "legendFormat": "{{namespace}}", "refId": "A"}],
+          "title": "Memory Usage by Namespace",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "core"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 5},
+          "id": 11,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\",container!=\"POD\"}[5m])) by (namespace)", "legendFormat": "{{namespace}}", "refId": "A"}],
+          "title": "CPU Usage by Namespace",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "Bps"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 14},
+          "id": 12,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_network_receive_bytes_total[5m])) by (namespace)", "legendFormat": "RX {{namespace}}", "refId": "A"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_network_transmit_bytes_total[5m])) by (namespace)", "legendFormat": "TX {{namespace}}", "refId": "B"}
+          ],
+          "title": "Network RX/TX by Namespace",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "decbytes"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 14},
+          "id": 13,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(container_fs_usage_bytes) by (instance)", "legendFormat": "{{instance}}", "refId": "A"}],
+          "title": "Filesystem Usage by Node",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 24, "x": 0, "y": 23},
+          "id": 20,
+          "options": {
+            "showHeader": true,
+            "cellHeight": "sm",
+            "footer": {"show": false, "reducer": ["sum"], "countRows": false, "fields": ""}
+          },
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sort_desc(sum(container_memory_working_set_bytes{container!=\"\",container!=\"POD\"}) by (namespace,pod))", "format": "table", "instant": true, "refId": "A"}],
+          "title": "Pods by Memory (live)",
+          "type": "table",
+          "transformations": [
+            {"id": "organize", "options": {"excludeByName": {"Time": true}, "renameByName": {"Value": "Memory (bytes)"}}}
+          ]
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "orange", "value": 1}, {"color": "red", "value": 5}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 24, "x": 0, "y": 32},
+          "id": 30,
+          "options": {
+            "showHeader": true,
+            "cellHeight": "sm",
+            "footer": {"show": false, "reducer": ["sum"], "countRows": false, "fields": ""}
+          },
+          "pluginVersion": "10.2.3",
+          "targets": [
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(kube_pod_status_phase{phase=\"Running\"}) by (namespace)", "format": "table", "instant": true, "refId": "A"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(kube_pod_status_phase{phase=\"Pending\"}) by (namespace)", "format": "table", "instant": true, "refId": "B"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(kube_pod_status_phase{phase=\"Failed\"}) by (namespace)", "format": "table", "instant": true, "refId": "C"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(increase(kube_pod_container_status_restarts_total[1h])) by (namespace)", "format": "table", "instant": true, "refId": "D"}
+          ],
+          "title": "Pod Health by Namespace (KSM)",
+          "type": "table",
+          "transformations": [
+            {"id": "merge", "options": {}},
+            {"id": "groupBy", "options": {"fields": {"Value": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #B": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #C": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #D": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "namespace": {"aggregations": [], "operation": "groupby"}}}},
+            {"id": "organize", "options": {"excludeByName": {"Time": true}, "renameByName": {"Value": "Running", "Value #B": "Pending", "Value #C": "Failed", "Value #D": "Restarts (1h)"}}}
+          ]
+        }
+      ],
+      "refresh": "30s",
+      "schemaVersion": 38,
+      "style": "dark",
+      "tags": ["k3s", "overview"],
+      "templating": {"list": []},
+      "time": {"from": "now-6h", "to": "now"},
+      "timepicker": {},
+      "timezone": "",
+      "title": "Cluster Overview",
+      "uid": "k3s-cluster-overview",
+      "version": 2,
+      "weekStart": ""
+    }
--- a/monitoring/grafana-dashboard-control-plane.yaml
+++ b/monitoring/grafana-dashboard-control-plane.yaml
@@ -0,0 +1,209 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-dashboard-control-plane
+  namespace: monitoring
+  labels:
+    app: grafana
+    grafana_dashboard: "1"
+data:
+  control-plane.json: |
+    {
+      "annotations": {"list": []},
+      "editable": true,
+      "graphTooltip": 1,
+      "id": null,
+      "links": [],
+      "liveNow": false,
+      "panels": [
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "reqps"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 0},
+          "id": 1,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(apiserver_request_total[5m])) by (verb)", "legendFormat": "{{verb}}", "refId": "A"}],
+          "title": "API Server Requests by Verb",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "s"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 0},
+          "id": 2,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "histogram_quantile(0.95, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb))", "legendFormat": "p95 {{verb}}", "refId": "A"}],
+          "title": "API Server Request Latency p95",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "ops"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 9},
+          "id": 3,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(apiserver_request_total{code=~\"5..\"}[5m])) by (verb)", "legendFormat": "{{verb}}", "refId": "A"}],
+          "title": "API Server 5xx Errors",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 9},
+          "id": 4,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(kubelet_container_log_filesystem_used_bytes[5m]))", "legendFormat": "log fs", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "histogram_quantile(0.95, sum(rate(kubelet_pod_start_duration_seconds_bucket[5m])) by (le))", "legendFormat": "pod start p95", "refId": "B"}],
+          "title": "Kubelet Pod Start Latency",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "s"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 18},
+          "id": 5,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "histogram_quantile(0.95, sum(rate(kubelet_cgroup_manager_duration_seconds_bucket[5m])) by (le, instance))", "legendFormat": "{{instance}}", "refId": "A"}],
+          "title": "Kubelet Cgroup Manager Duration p95",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 18},
+          "id": 6,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "rate(kubelet_pleg_relist_duration_seconds_count[5m])", "legendFormat": "relists/s {{instance}}", "refId": "A"}],
+          "title": "Kubelet PLEG Relist Rate",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "mappings": [
+                {"options": {"0": {"text": "Down", "color": "red"}, "1": {"text": "Up", "color": "green"}}, "type": "value"}
+              ],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}, {"color": "green", "value": 1}]}
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 6, "w": 24, "x": 0, "y": 27},
+          "id": 7,
+          "options": {
+            "colorMode": "background",
+            "graphMode": "none",
+            "justifyMode": "center",
+            "orientation": "horizontal",
+            "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
+            "textMode": "value_and_name"
+          },
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "up", "refId": "A"}],
+          "title": "All Scrape Targets Status",
+          "type": "stat"
+        }
+      ],
+      "refresh": "30s",
+      "schemaVersion": 38,
+      "style": "dark",
+      "tags": ["k3s", "control-plane"],
+      "templating": {"list": []},
+      "time": {"from": "now-6h", "to": "now"},
+      "timepicker": {},
+      "timezone": "",
+      "title": "Control Plane & API Server",
+      "uid": "k3s-control-plane",
+      "version": 1,
+      "weekStart": ""
+    }
--- a/monitoring/grafana-dashboard-nodes.yaml
+++ b/monitoring/grafana-dashboard-nodes.yaml
@@ -0,0 +1,279 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-dashboard-nodes
+  namespace: monitoring
+  labels:
+    app: grafana
+    grafana_dashboard: "1"
+data:
+  nodes.json: |
+    {
+      "annotations": {"list": []},
+      "editable": true,
+      "graphTooltip": 1,
+      "id": null,
+      "links": [],
+      "liveNow": false,
+      "panels": [
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 6, "w": 6, "x": 0, "y": 0},
+          "id": 1,
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "auto",
+            "orientation": "auto",
+            "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
+            "textMode": "value_and_name"
+          },
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "kubelet_running_pods", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "kubelet_running_containers", "refId": "B"}],
+          "title": "Pods / Containers per Node",
+          "type": "stat"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "orange", "value": 70}, {"color": "red", "value": 90}]},
+              "unit": "percent",
+              "min": 0,
+              "max": 100
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 6, "w": 18, "x": 6, "y": 0},
+          "id": 2,
+          "options": {
+            "colorMode": "background",
+            "graphMode": "area",
+            "justifyMode": "auto",
+            "orientation": "horizontal",
+            "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
+            "textMode": "value_and_name"
+          },
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)", "legendFormat": "{{instance}}", "refId": "A"}],
+          "title": "Node CPU Usage %",
+          "type": "stat"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "percent",
+              "min": 0,
+              "max": 100
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 6},
+          "id": 3,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)", "legendFormat": "{{instance}}", "refId": "A"}],
+          "title": "Node CPU Usage % (over time)",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "percent",
+              "min": 0,
+              "max": 100
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 6},
+          "id": 4,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100", "legendFormat": "{{instance}}", "refId": "A"}],
+          "title": "Node Memory Usage %",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "bytes"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 15},
+          "id": 5,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(container_memory_working_set_bytes{container!=\"\",container!=\"POD\"}) by (instance)", "legendFormat": "used {{instance}}", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "node_memory_MemTotal_bytes", "legendFormat": "total {{instance}}", "refId": "B"}],
+          "title": "Node Memory (used vs total)",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "Bps"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 15},
+          "id": 6,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (instance) (rate(node_network_receive_bytes_total{device!~\"lo|veth.*|docker.*|br-.*|cni.*|flannel.*\"}[5m]))", "legendFormat": "RX {{instance}}", "refId": "A"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (instance) (rate(node_network_transmit_bytes_total{device!~\"lo|veth.*|docker.*|br-.*|cni.*|flannel.*\"}[5m]))", "legendFormat": "TX {{instance}}", "refId": "B"}
+          ],
+          "title": "Node Network Traffic",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "percent",
+              "min": 0,
+              "max": 100
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 24},
+          "id": 7,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "(1 - (node_filesystem_avail_bytes{fstype!~\"tmpfs|overlay|squashfs\"} / node_filesystem_size_bytes{fstype!~\"tmpfs|overlay|squashfs\"})) * 100", "legendFormat": "{{instance}} {{mountpoint}}", "refId": "A"}],
+          "title": "Node Disk Usage %",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 24},
+          "id": 8,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "node_load1", "legendFormat": "1m {{instance}}", "refId": "A"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "node_load5", "legendFormat": "5m {{instance}}", "refId": "B"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "node_load15", "legendFormat": "15m {{instance}}", "refId": "C"}
+          ],
+          "title": "Node Load Average",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 24, "x": 0, "y": 33},
+          "id": 9,
+          "options": {
+            "showHeader": true,
+            "cellHeight": "sm",
+            "footer": {"show": false, "reducer": ["sum"], "countRows": false, "fields": ""}
+          },
+          "pluginVersion": "10.2.3",
+          "targets": [
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "kubelet_running_pods", "format": "table", "instant": true, "refId": "A"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "kubelet_running_containers", "format": "table", "instant": true, "refId": "B"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)", "format": "table", "instant": true, "refId": "C"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100", "format": "table", "instant": true, "refId": "D"}
+          ],
+          "title": "Node Summary (live)",
+          "type": "table",
+          "transformations": [
+            {"id": "merge", "options": {}},
+            {"id": "groupBy", "options": {"fields": {"Value": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #B": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #C": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #D": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "instance": {"aggregations": [], "operation": "groupby"}}}},
+            {"id": "organize", "options": {"excludeByName": {"Time": true}, "renameByName": {"Value": "Pods", "Value #B": "Containers", "Value #C": "CPU %", "Value #D": "Memory %"}}}
+          ]
+        }
+      ],
+      "refresh": "30s",
+      "schemaVersion": 38,
+      "style": "dark",
+      "tags": ["k3s", "nodes"],
+      "templating": {"list": []},
+      "time": {"from": "now-6h", "to": "now"},
+      "timepicker": {},
+      "timezone": "",
+      "title": "Nodes",
+      "uid": "k3s-nodes",
+      "version": 2,
+      "weekStart": ""
+    }
--- a/monitoring/grafana-dashboard-pods.yaml
+++ b/monitoring/grafana-dashboard-pods.yaml
@@ -0,0 +1,312 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-dashboard-pods
+  namespace: monitoring
+  labels:
+    app: grafana
+    grafana_dashboard: "1"
+data:
+  pods.json: |
+    {
+      "annotations": {"list": []},
+      "editable": true,
+      "graphTooltip": 1,
+      "id": null,
+      "links": [],
+      "liveNow": false,
+      "panels": [
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "normal"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "core"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 24, "x": 0, "y": 0},
+          "id": 1,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\",container!=\"POD\",namespace=~\"$namespace\"}[5m])) by (pod)", "legendFormat": "{{pod}}", "refId": "A"}],
+          "title": "CPU Usage per Pod",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "normal"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "bytes"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 24, "x": 0, "y": 9},
+          "id": 2,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(container_memory_working_set_bytes{container!=\"\",container!=\"POD\",namespace=~\"$namespace\"}) by (pod)", "legendFormat": "{{pod}}", "refId": "A"}],
+          "title": "Memory Usage per Pod",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "Bps"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 18},
+          "id": 3,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_network_receive_bytes_total{namespace=~\"$namespace\"}[5m])) by (pod)", "legendFormat": "RX {{pod}}", "refId": "A"}
+          ],
+          "title": "Network RX per Pod",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "Bps"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 18},
+          "id": 4,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\"}[5m])) by (pod)", "legendFormat": "TX {{pod}}", "refId": "A"}
+          ],
+          "title": "Network TX per Pod",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "bytes"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 27},
+          "id": 5,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(container_fs_usage_bytes{namespace=~\"$namespace\"}) by (pod)", "legendFormat": "{{pod}}", "refId": "A"}],
+          "title": "Filesystem Usage per Pod",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "percent"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 27},
+          "id": 6,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_cpu_cfs_throttled_seconds_total{namespace=~\"$namespace\"}[5m])) by (pod) / sum(rate(container_cpu_cfs_periods_total{namespace=~\"$namespace\"}[5m])) by (pod) * 100", "legendFormat": "{{pod}}", "refId": "A"}],
+          "title": "CPU Throttling % per Pod",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 5}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 10, "w": 24, "x": 0, "y": 36},
+          "id": 7,
+          "options": {
+            "showHeader": true,
+            "cellHeight": "sm",
+            "footer": {"show": false, "reducer": ["sum"], "countRows": false, "fields": ""}
+          },
+          "pluginVersion": "10.2.3",
+          "targets": [
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (namespace, pod) (container_memory_working_set_bytes{container!=\"\",container!=\"POD\",namespace=~\"$namespace\"})", "format": "table", "instant": true, "refId": "A"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=\"\",container!=\"POD\",namespace=~\"$namespace\"}[5m]))", "format": "table", "instant": true, "refId": "B"},
+            {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (namespace, pod) (rate(container_network_receive_bytes_total{namespace=~\"$namespace\"}[5m]))", "format": "table", "instant": true, "refId": "C"}
+          ],
+          "title": "Pod Resource Summary (live)",
+          "type": "table",
+          "transformations": [
+            {"id": "merge", "options": {}},
+            {"id": "groupBy", "options": {"fields": {"Value": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #B": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #C": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "namespace": {"aggregations": [], "operation": "groupby"}, "pod": {"aggregations": [], "operation": "groupby"}}}},
+            {"id": "organize", "options": {"excludeByName": {"Time": true}, "renameByName": {"Value": "Memory (bytes)", "Value #B": "CPU (cores)", "Value #C": "Network RX (Bps)"}}}
+          ]
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 46},
+          "id": 8,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (namespace) (kube_pod_status_phase{phase=~\"Running|Pending|Failed\",namespace=~\"$namespace\"})", "legendFormat": "{{namespace}} {{phase}}", "refId": "A"}],
+          "title": "Pod Status by Namespace (KSM)",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 46},
+          "id": 9,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (namespace) (increase(kube_pod_container_status_restarts_total{namespace=~\"$namespace\"}[1h]))", "legendFormat": "{{namespace}}", "refId": "A"}],
+          "title": "Container Restarts (last 1h)",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {
+                "drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
+                "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
+              },
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "bytes"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 24, "x": 0, "y": 55},
+          "id": 10,
+          "options": {
+            "legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
+            "tooltip": {"mode": "multi", "sort": "desc"}
+          },
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace=~\"$namespace\"}", "legendFormat": "{{namespace}}/{{persistentvolumeclaim}}", "refId": "A"}],
+          "title": "PVC Storage Requests by Claim (KSM)",
+          "type": "timeseries"
+        }
+      ],
+      "refresh": "30s",
+      "schemaVersion": 38,
+      "style": "dark",
+      "tags": ["k3s", "pods"],
+      "templating": {
+        "list": [
+          {
+            "allValue": ".*",
+            "current": {"selected": true, "text": "All", "value": "$__all"},
+            "datasource": {"type": "prometheus", "uid": "Prometheus"},
+            "definition": "label_values(container_cpu_usage_seconds_total, namespace)",
+            "hide": 0,
+            "includeAll": true,
+            "multi": true,
+            "name": "namespace",
+            "options": [],
+            "query": "label_values(container_cpu_usage_seconds_total, namespace)",
+            "refresh": 2,
+            "regex": "",
+            "skipUrlSync": false,
+            "sort": 1,
+            "type": "query"
+          }
+        ]
+      },
+      "time": {"from": "now-6h", "to": "now"},
+      "timepicker": {},
+      "timezone": "",
+      "title": "Pods & Services",
+      "uid": "k3s-pods",
+      "version": 2,
+      "weekStart": ""
+    }
--- a/monitoring/grafana-dashboard-prometheus.yaml
+++ b/monitoring/grafana-dashboard-prometheus.yaml
@@ -0,0 +1,218 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-dashboard-prometheus
+  namespace: monitoring
+  labels:
+    app: grafana
+    grafana_dashboard: "1"
+data:
+  prometheus.json: |
+    {
+      "annotations": {"list": []},
+      "editable": true,
+      "graphTooltip": 1,
+      "id": null,
+      "links": [],
+      "liveNow": false,
+      "panels": [
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}, {"color": "green", "value": 1}]},
+              "mappings": [{"options": {"0": {"text": "DOWN", "color": "red"}, "1": {"text": "UP", "color": "green"}}, "type": "value"}]
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 5, "w": 6, "x": 0, "y": 0},
+          "id": 1,
+          "options": {"colorMode": "background", "graphMode": "none", "justifyMode": "center", "orientation": "horizontal", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "value"},
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "up{job=\"prometheus\"}", "refId": "A"}],
+          "title": "Prometheus Status",
+          "type": "stat"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "bytes"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 5, "w": 6, "x": 6, "y": 0},
+          "id": 2,
+          "options": {"colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "auto"},
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "process_resident_memory_bytes{job=\"prometheus\"}", "refId": "A"}],
+          "title": "Prometheus RSS Memory",
+          "type": "stat"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 5, "w": 6, "x": 12, "y": 0},
+          "id": 3,
+          "options": {"colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "auto"},
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "prometheus_tsdb_head_series", "refId": "A"}],
+          "title": "Active Series",
+          "type": "stat"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 5, "w": 6, "x": 18, "y": 0},
+          "id": 4,
+          "options": {"colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "auto"},
+          "pluginVersion": "10.2.3",
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "count(up)", "refId": "A"}],
+          "title": "Scrape Targets",
+          "type": "stat"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "bytes"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 5},
+          "id": 10,
+          "options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "process_resident_memory_bytes{job=\"prometheus\"}", "legendFormat": "RSS", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "prometheus_tsdb_head_memory_postings_total", "legendFormat": "postings", "refId": "B"}],
+          "title": "Prometheus Memory",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "core"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 5},
+          "id": 11,
+          "options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "rate(process_cpu_seconds_total{job=\"prometheus\"}[5m])", "legendFormat": "prometheus", "refId": "A"}],
+          "title": "Prometheus CPU",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 14},
+          "id": 12,
+          "options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "rate(prometheus_tsdb_head_samples_appended_total[5m])", "legendFormat": "samples/s", "refId": "A"}],
+          "title": "Ingestion Rate (samples/s)",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "s"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 14},
+          "id": 13,
+          "options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "scrape_duration_seconds", "legendFormat": "{{job}} {{instance}}", "refId": "A"}],
+          "title": "Scrape Duration by Job",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "bytes"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 0, "y": 23},
+          "id": 14,
+          "options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "prometheus_tsdb_head_series", "legendFormat": "head series", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "prometheus_tsdb_head_chunks", "legendFormat": "head chunks", "refId": "B"}],
+          "title": "TSDB Head Series & Chunks",
+          "type": "timeseries"
+        },
+        {
+          "datasource": {"type": "prometheus", "uid": "Prometheus"},
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
+              "mappings": [],
+              "thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
+              "unit": "s"
+            },
+            "overrides": []
+          },
+          "gridPos": {"h": 9, "w": 12, "x": 12, "y": 23},
+          "id": 15,
+          "options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
+          "targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "rate(prometheus_http_request_duration_seconds_sum[5m]) / rate(prometheus_http_request_duration_seconds_count[5m])", "legendFormat": "avg HTTP req", "refId": "A"}],
+          "title": "Prometheus HTTP Request Duration",
+          "type": "timeseries"
+        }
+      ],
+      "refresh": "30s",
+      "schemaVersion": 38,
+      "style": "dark",
+      "tags": ["k3s", "prometheus"],
+      "templating": {"list": []},
+      "time": {"from": "now-6h", "to": "now"},
+      "timepicker": {},
+      "timezone": "",
+      "title": "Prometheus Self-Monitoring",
+      "uid": "k3s-prometheus",
+      "version": 1,
+      "weekStart": ""
+    }
--- a/monitoring/grafana-dashboard-provider.yaml
+++ b/monitoring/grafana-dashboard-provider.yaml
@@ -0,0 +1,20 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-dashboard-provider
+  namespace: monitoring
+  labels:
+    app: grafana
+data:
+  provider.yaml: |
+    apiVersion: 1
+    providers:
+      - name: 'k3s-dashboards'
+        orgId: 1
+        folder: 'K3s Cluster'
+        type: file
+        disableDeletion: false
+        updateIntervalSeconds: 30
+        allowUiUpdates: true
+        options:
+          path: /var/lib/grafana/dashboards
--- a/monitoring/grafana-deployment.yaml
+++ b/monitoring/grafana-deployment.yaml
@@ -33,6 +33,18 @@ spec:
              mountPath: /var/lib/grafana
            - name: grafana-datasources
              mountPath: /etc/grafana/provisioning/datasources
+            - name: grafana-dashboard-provider
+              mountPath: /etc/grafana/provisioning/dashboards
+            - name: dashboards-cluster-overview
+              mountPath: /var/lib/grafana/dashboards/cluster-overview
+            - name: dashboards-pods
+              mountPath: /var/lib/grafana/dashboards/pods
+            - name: dashboards-nodes
+              mountPath: /var/lib/grafana/dashboards/nodes
+            - name: dashboards-control-plane
+              mountPath: /var/lib/grafana/dashboards/control-plane
+            - name: dashboards-prometheus
+              mountPath: /var/lib/grafana/dashboards/prometheus
          resources:
            requests:
              memory: "256Mi"
@@ -47,3 +59,21 @@ spec:
        - name: grafana-datasources
          configMap:
            name: grafana-datasources
+        - name: grafana-dashboard-provider
+          configMap:
+            name: grafana-dashboard-provider
+        - name: dashboards-cluster-overview
+          configMap:
+            name: grafana-dashboard-cluster-overview
+        - name: dashboards-pods
+          configMap:
+            name: grafana-dashboard-pods
+        - name: dashboards-nodes
+          configMap:
+            name: grafana-dashboard-nodes
+        - name: dashboards-control-plane
+          configMap:
+            name: grafana-dashboard-control-plane
+        - name: dashboards-prometheus
+          configMap:
+            name: grafana-dashboard-prometheus
--- a/monitoring/ingress.yaml
+++ b/monitoring/ingress.yaml
@@ -30,6 +30,6 @@ spec:
          pathType: Prefix
          backend:
            service:
-              name: prometheus-k8s
+              name: prometheus
              port:
-                number: 80
+                number: 9090
--- a/monitoring/kube-state-metrics.yaml
+++ b/monitoring/kube-state-metrics.yaml
@@ -0,0 +1,118 @@
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: kube-state-metrics
+  namespace: monitoring
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: kube-state-metrics
+rules:
+  - apiGroups: [""]
+    resources:
+      - configmaps
+      - secrets
+      - nodes
+      - pods
+      - services
+      - resourcequotas
+      - replicationcontrollers
+      - limitranges
+      - persistentvolumeclaims
+      - persistentvolumes
+      - namespaces
+      - endpoints
+    verbs: ["list", "watch"]
+  - apiGroups: ["apps"]
+    resources: ["statefulsets", "daemonsets", "deployments", "replicasets"]
+    verbs: ["list", "watch"]
+  - apiGroups: ["batch"]
+    resources: ["cronjobs", "jobs"]
+    verbs: ["list", "watch"]
+  - apiGroups: ["autoscaling"]
+    resources: ["horizontalpodautoscalers"]
+    verbs: ["list", "watch"]
+  - apiGroups: ["networking.k8s.io"]
+    resources: ["ingresses"]
+    verbs: ["list", "watch"]
+  - apiGroups: ["storage.k8s.io"]
+    resources: ["storageclasses", "volumeattachments"]
+    verbs: ["list", "watch"]
+  - apiGroups: ["certificates.k8s.io"]
+    resources: ["certificatesigningrequests"]
+    verbs: ["list", "watch"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: kube-state-metrics
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: kube-state-metrics
+subjects:
+  - kind: ServiceAccount
+    name: kube-state-metrics
+    namespace: monitoring
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: kube-state-metrics
+  namespace: monitoring
+  labels:
+    app: kube-state-metrics
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: kube-state-metrics
+  template:
+    metadata:
+      labels:
+        app: kube-state-metrics
+    spec:
+      serviceAccountName: kube-state-metrics
+      containers:
+        - name: kube-state-metrics
+          image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.1
+          ports:
+            - containerPort: 8080
+              name: http-metrics
+            - containerPort: 8081
+              name: telemetry
+          readinessProbe:
+            httpGet:
+              path: /
+              port: 8081
+            initialDelaySeconds: 5
+            timeoutSeconds: 5
+          resources:
+            requests:
+              memory: "128Mi"
+              cpu: "100m"
+            limits:
+              memory: "512Mi"
+              cpu: "500m"
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: kube-state-metrics
+  namespace: monitoring
+  labels:
+    app: kube-state-metrics
+  annotations:
+    prometheus.io/scrape: "true"
+    prometheus.io/port: "8080"
+spec:
+  selector:
+    app: kube-state-metrics
+  ports:
+    - name: http-metrics
+      port: 8080
+      targetPort: http-metrics
+    - name: telemetry
+      port: 8081
+      targetPort: telemetry
--- a/monitoring/node-exporter.yaml
+++ b/monitoring/node-exporter.yaml
@@ -0,0 +1,112 @@
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: node-exporter
+  namespace: monitoring
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: node-exporter
+  namespace: monitoring
+  labels:
+    app: node-exporter
+  annotations:
+    prometheus.io/scrape: "true"
+    prometheus.io/port: "9100"
+spec:
+  selector:
+    app: node-exporter
+  ports:
+    - name: metrics
+      port: 9100
+      targetPort: 9100
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: node-exporter
+rules:
+  - apiGroups: [""]
+    resources: ["nodes"]
+    verbs: ["get", "list", "watch"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: node-exporter
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: node-exporter
+subjects:
+  - kind: ServiceAccount
+    name: node-exporter
+    namespace: monitoring
+---
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: node-exporter
+  namespace: monitoring
+  labels:
+    app: node-exporter
+spec:
+  selector:
+    matchLabels:
+      app: node-exporter
+  template:
+    metadata:
+      labels:
+        app: node-exporter
+    spec:
+      serviceAccountName: node-exporter
+      hostPID: true
+      hostNetwork: true
+      tolerations:
+        - key: node-role.kubernetes.io/control-plane
+          operator: Exists
+          effect: NoSchedule
+        - key: node-role.kubernetes.io/master
+          operator: Exists
+          effect: NoSchedule
+      containers:
+        - name: node-exporter
+          image: prom/node-exporter:v1.7.0
+          args:
+            - --path.procfs=/host/proc
+            - --path.sysfs=/host/sys
+            - --path.rootfs=/host/root
+            - --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+)($|/)
+            - --collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
+          ports:
+            - containerPort: 9100
+              hostPort: 9100
+              name: metrics
+          volumeMounts:
+            - name: proc
+              mountPath: /host/proc
+              readOnly: true
+            - name: sys
+              mountPath: /host/sys
+              readOnly: true
+            - name: root
+              mountPath: /host/root
+              readOnly: true
+          resources:
+            requests:
+              memory: "64Mi"
+              cpu: "50m"
+            limits:
+              memory: "128Mi"
+              cpu: "200m"
+      volumes:
+        - name: proc
+          hostPath:
+            path: /proc
+        - name: sys
+          hostPath:
+            path: /sys
+        - name: root
+          hostPath:
+            path: /
--- a/nas/ingress.yaml
+++ b/nas/ingress.yaml
@@ -3,46 +3,75 @@ kind: Namespace
 metadata:
  name: nas-proxy
 ---
+# Standalone cert-manager Certificate for nas.rogi.casa (not owned by an Ingress,
+# since cert-manager's ingress-shim would otherwise create one owned by the
+# Ingress below and tie its lifecycle to it; keeping it standalone is cleaner).
+apiVersion: cert-manager.io/v1
+kind: Certificate
+metadata:
+  name: nas-tls
+  namespace: nas-proxy
+spec:
+  secretName: nas-tls
+  dnsNames:
+    - nas.rogi.casa
+  issuerRef:
+    group: cert-manager.io
+    kind: ClusterIssuer
+    name: letsencrypt-prod
+  usages:
+    - digital signature
+    - key encipherment
+---
+# Selector-less Service + manual Endpoints pointing at the NAS.
+# (Endpoints is no longer excluded in argocd-cm, so ArgoCD manages it.)
 apiVersion: v1
 kind: Service
 metadata:
  name: synology-nas
  namespace: nas-proxy
 spec:
-  type: ExternalName
-  externalName: "10.88.30.10"
+  type: ClusterIP
+  clusterIP: None
  ports:
-  - port: 5001
-    targetPort: 5001
-    protocol: TCP
+    - port: 5001
+      targetPort: 5001
+      protocol: TCP
 ---
-apiVersion: networking.k8s.io/v1
-kind: Ingress
+apiVersion: v1
+kind: Endpoints
+metadata:
+  name: synology-nas
+  namespace: nas-proxy
+subsets:
+  - addresses:
+      - ip: 10.88.30.10
+    ports:
+      - port: 5001
+        protocol: TCP
+---
+# Traefik IngressRoute (CRD provider) where scheme: https is a first-class
+# field. The standard kubernetes Ingress `service.serversscheme` annotation is
+# ignored for selector-less/Endpoints-backed services in Traefik v3, which
+# caused Traefik to dial the NAS with plain HTTP -> 400 from DSM's nginx.
+apiVersion: traefik.io/v1alpha1
+kind: IngressRoute
 metadata:
  name: nas
  namespace: nas-proxy
-  annotations:
-    cert-manager.io/cluster-issuer: letsencrypt-prod
-    # Tell Traefik the backend is HTTPS (DSM uses HTTPS on 5001)
-    traefik.ingress.kubernetes.io/router.tls: "true"
-    # Skip backend TLS verification since DSM uses a self-signed cert
-    traefik.ingress.kubernetes.io/service.serversscheme: https
-    traefik.ingress.kubernetes.io/service.serverstransport: skip-verify@file
-    traefik.ingress.kubernetes.io/max-request-body-bytes: "5368709120"
 spec:
-  ingressClassName: traefik
+  entryPoints:
+    - websecure
+  routes:
+    - match: Host(`nas.rogi.casa`)
+      kind: Rule
+      services:
+        - kind: Service
+          name: synology-nas
+          namespace: nas-proxy
+          port: 5001
+          scheme: https
+          serversTransport: skip-verify
+          passHostHeader: true
  tls:
-  - hosts:
-    - nas.rogi.casa
    secretName: nas-tls
-  rules:
-  - host: nas.rogi.casa
-    http:
-      paths:
-      - path: /
-        pathType: Prefix
-        backend:
-          service:
-            name: synology-nas
-            port:
-              number: 5001
--- a/platform-engineer/README.md
+++ b/platform-engineer/README.md
@@ -0,0 +1,367 @@
+# Platform Engineer Agent — Deployment Plan
+
+An autonomous **Hermes Agent** that runs inside the k3s cluster, watches its
+health on a schedule, tries to fix simple problems, and notifies me (via
+Discord) when something needs my attention or a fix failed.
+
+Docs: https://hermes-agent.nousresearch.com/docs/user-guide/docker
+
+---
+
+## 1. Goal & operating model
+
+- **One Hermes container** in a new namespace `platform-engineer`, scheduled on
+  the powerful amd64 node (`roger-nucbox-evo-x2`, 24 GiB RAM).
+- Hermes runs in **gateway mode** under s6 supervision (`command: gateway run`),
+  so the built-in **cron scheduler** is active and survives restarts.
+- The agent talks to the cluster with `kubectl` from *inside* the container
+  (terminal backend = `local`). We give the pod a **ServiceAccount + ClusterRole**
+  scoped to read-mostly + restart/scale/delete-pod permissions.
+- LLM calls are routed through the in-cluster **LiteLLM** proxy
+  (`litellm.rogi.casa`) — no external API keys needed in the cluster.
+- Notifications go to **Discord** (reuse the pattern from `myorg-assistant`).
+- A set of **cron jobs** (Hermes-native, not Kubernetes CronJobs) make the agent
+  run periodic checks. Watchdog checks use `[SILENT]` so it only pings me when
+  something is wrong.
+
+Why Hermes-native cron (not k8s CronJobs):
+- Hermes cron ticks inside the gateway, runs in an isolated agent session,
+  supports `[SILENT]` suppression, `deliver="discord"`, `workdir`, and
+  `context_from` chaining — far less plumbing than spawning a fresh pod per run.
+- Cron jobs live in `~/.hermes/cron/jobs.json` on the PVC, so they survive pod
+  restarts and can be edited live via `hermes cron edit` without redeploying.
+
+---
+
+## 2. Files to create (this directory)
+
+```
+platform-engineer/
+├── namespace.yaml              # namespace platform-engineer
+├── rbac.yaml                    # ServiceAccount + ClusterRole (+binding)
+├── configmap.yaml               # hermes config.yaml + SOUL.md + cron seed script
+├── secret.yaml                  # DISCORD bot token, LITELLM_API_KEY, kubeconfig-less SA token
+├── pvc.yaml                     # persistent /opt/data (HERMES_HOME)
+├── dockerfile                   # derived image: hermes-agent + kubectl + helm
+├── deployment.yaml              # Deployment, schedules on amd64, mounts kube SA token
+├── ingress.yaml                 # hermes.rogi.casa → dashboard (optional)
+└── README.md                    # this file
+```
+
+Then add a line to `argocd/gen-apps.sh` `APPS=(...)`:
+```
+"platform-engineer|platform-engineer|platform-engineer|true|true"
+```
+and re-run `./argocd/gen-apps.sh` to generate `argocd/apps/platform-engineer.yaml`
+so ArgoCD reconciles it like every other app in the repo.
+
+---
+
+## 3. RBAC — least privilege
+
+ServiceAccount `platform-engineer` in ns `platform-engineer`, bound to a
+**ClusterRole** scoped to *platform engineer* actions:
+
+**Read (get/list/watch):** nodes, pods, services, deployments, statefulsets,
+daemonsets, replicasets, jobs, cronjobs, events, configmaps, secrets, PVCs,
+ingresses, namespaces.
+
+**Act (patch/update on a allowlist):**
+- `pods` → `delete` (force-restart a stuck pod), `patch` (`/evict`, annotations)
+- `deployments`, `statefulsets`, `daemonsets`, `replicasets` → `patch` (restart
+  via `kubectl rollout restart` / scale), `update`
+- `jobs`, `cronjobs` → `delete`, `patch`
+- `pods/exec` (subresource) → `create` (only if we want the agent to `kubectl
+  exec` into pods for log-style debugging — optional; keep off initially)
+- `events` → `get/list/watch` only
+
+**No cluster-scoped writes** (no creating namespaces, no node taints, no RBAC
+edits, no CRDs). The agent can *propose* those and tell me; it cannot do them
+itself. All mutating calls are auditable via Kubernetes audit logs and
+`kubectl auth can-i --as=system:serviceaccount:platform-engineer:platform-engineer`.
+
+The pod uses the k3s in-cluster ServiceAccount token (`/var/run/secrets/...
+/serviceaccount/token`) + the `KUBERNETES_SERVICE_HOST/PORT` env vars k3s already
+injects — **no kubeconfig file, no long-lived token on disk**.
+
+---
+
+## 4. Image — thin derived Dockerfile
+
+```dockerfile
+FROM nousresearch/hermes-agent:latest
+USER root
+RUN apt-get update \
+ && apt-get install -y --no-install-recommends curl gnupg \
+ && curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key \
+    | gpg --dearmor -o /usr/share/keyrings/kubernetes-apt-keyring.gpg \
+ && echo 'deb [signed-by=/usr/share/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /' \
+    > /etc/apt/sources.list.d/kubernetes.list \
+ && apt-get update \
+ && apt-get install -y --no-install-recommends kubectl \
+ && curl -fsSL https://get.helm.sh/helm-v3.16.0-linux-amd64.tar.gz \
+    | tar -xz -C /usr/local/bin --strip-components=1 linux-amd64/helm \
+ && rm -rf /var/lib/apt/lists/*
+USER hermes
+```
+
+> Note: the cluster is mixed arch (arm64/amd64/arm). The agent pod is pinned to
+> the amd64 node, so `linux-amd64` helm + `kubectl` packages are fine. If you
+> later want it portable, switch to a multi-arch build with
+> `TARGETARCH` and install matching helm arch.
+
+Build & push to your Gitea registry (`git.rogi.casa/roger/...`) — same
+`imagePullSecrets: gitea-registry` pattern as `gym-tracker`. Tag with the
+hermes version + a short git sha.
+
+---
+
+## 5. Hermes configuration (mounted via ConfigMap → /opt/data/config.yaml)
+
+```yaml
+# config.yaml (seeded into the PVC on first boot)
+model:
+  provider: openai-api
+  default: claude-4.5-haiku
+  base_url: "https://litellm.rogi.casa/v1"
+  api_mode: chat_completions
+
+# Use a cheap, fast model for auxiliary tasks (titling, compression)
+auxiliary:
+  compression:
+    provider: openai-api
+    model: gemini-3-flash
+  title_generation:
+    provider: openai-api
+    model: gemini-3-flash
+
+terminal:
+  backend: local
+  cwd: /workspace            # a working dir for any kubectl output / scratch
+  timeout: 180
+  home_mode: profile        # isolate tool credentials under HERMES_HOME/home
+
+# Unattended gateway → circuit-breaker on tool-call loops
+tool_loop_guardrails:
+  hard_stop_enabled: true
+  hard_stop_after:
+    exact_failure: 5
+    idempotent_no_progress: 5
+
+sessions:
+  auto_prune: true
+  retention_days: 90
+
+cron:
+  wrap_response: false      # cleaner Discord messages
+
+memory:
+  memory_enabled: true
+  user_profile_enabled: true
+```
+
+`.env` (from Secret, mounted to `/opt/data/.env`):
+```
+OPENAI_API_KEY=<LITELLM_API_KEY value, i.e. sk-...>
+OPENAI_BASE_URL=https://litellm.rogi.casa/v1
+DISCORD_BOT_TOKEN=<new dedicated bot token>
+DISCORD_HOME_CHANNEL=<your user/channel id for alerts>
+# Dashboard auth (homelab, trusted LAN behind ingress)
+HERMES_DASHBOARD_BASIC_AUTH_USERNAME=roger
+HERMES_DASHBOARD_BASIC_AUTH_PASSWORD=<strong password>
+```
+
+> Why `OPENAI_API_KEY` + `OPENAI_BASE_URL`: the `openai-api` provider honours
+> `OPENAI_BASE_URL`, so this is the simplest way to point Hermes at the
+> in-cluster LiteLLM. `claude-4.5-haiku` / `gemini-3-flash` are the model names
+> already exposed by your `litellm/litellm.yaml` ConfigMap.
+
+`SOUL.md` (personality + guardrails) — see `configmap.yaml`. Key points:
+- Identity: "Platform Engineer for the rogi.casa k3s cluster."
+- Knows the cluster layout (3 nodes, ArgoCD GitOps, Traefik+cert-manager,
+  LiteLLM, services list).
+- Operating rules: read-first; only act on the allowlisted verbs; never edit
+  RBAC / taints / namespaces / CRDs; when in doubt, notify instead of acting;
+  always cite the resource and the command used.
+- How to reach me: `deliver="discord"`.
+
+---
+
+## 6. Deployment
+
+- `replicas: 1` (Hermes data dir is single-writer — never scale >1).
+- `nodeSelector: kubernetes.io/arch: amd64` + preferred `hardware: high-memory`
+  affinity → lands on the NUC.
+- `resources`: requests 512Mi/250m, limits 2Gi/1 core (Hermes recommends
+  2–4 GiB; 1 GiB is fine without browser tools, which we keep off).
+- Volume: PVC mounted at `/opt/data` (HERMES_HOME), RWX not needed (single pod).
+- Ports: 8642 (gateway API, internal only) and 9119 (dashboard) → exposed via
+  Ingress `hermes.rogi.casa` with TLS + basic-auth (already enforced by the
+  `HERMES_DASHBOARD_BASIC_AUTH_*` env vars).
+- `imagePullSecrets: gitea-registry`.
+- env from Secret; `HERMES_DASHBOARD=1`.
+- Init: on first boot the s6 `01-hermes-setup` hook seeds config/SOUL/.env from
+  the ConfigMap if the volume is empty. We mount the ConfigMap as a readonly
+  projection at `/opt/seed/` and run a tiny initContainer to copy it into
+  `/opt/data` only when `/opt/data/config.yaml` doesn't exist (so ArgoCD
+  self-heal never fights the agent's live-edited config).
+
+---
+
+## 7. Cron jobs to seed (Hermes-native)
+
+These are written by an init script (one-shot Job `hermes-cron-seed`) that runs
+`hermes cron create ...` against the gateway on first install, and is idempotent
+(it checks existing job names). All deliver to Discord. Examples:
+
+| Name | Schedule | Prompt (abbreviated) |
+|------|----------|------------------------|
+| `cluster-health-check` | `every 15m` | Run `kubectl get nodes,pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded` and `kubectl get events -A --field-selector type=Warning --since=20m`. If everything healthy, reply with only `[SILENT]`. Otherwise summarize failures and root-cause briefly. |
+| `pod-restart-loop` | `every 10m` | Find pods in `CrashLoopBackOff`/`ImagePullBackOff` across all namespaces. For `CrashLoopBackOff`, fetch logs and if a clear transient cause (OOM, config parse, missing secret) is visible, attempt `kubectl rollout restart <deploy>`; otherwise notify me with the log excerpt. Reply `[SILENT]` if none found. |
+| `pvc-pressure` | `every 30m` | `kubectl get pv` + node disk via `kubectl top nodes`. Alert if any PVC `Bound` to a near-full volume or node disk >85%. `[SILENT]` otherwise. |
+| `argocd-sync-health` | `every 1h` | `kubectl get applications -n argocd -o wide` (or `argocd app sync --dry-run` if CLI present). Report any `OutOfSync`/`Degraded` app. `[SILENT]` if all `Synced`+`Healthy`. |
+| `cert-expiry` | `every 1d at 09:00` | List cert-manager `Certificate` resources with expiry < 21 days. Notify only if any. `[SILENT]` otherwise. |
+| `node-resource-drift` | `every 30m` | `kubectl top nodes`. Alert if any node CPU>90% or mem>90% sustained, or any node `NotReady`. `[SILENT]` otherwise. |
+| `daily-cluster-report` | `0 8 * * *` | Summarize: node count/status, top 5 pods by CPU/mem, # pods not Running, # ArgoCD apps OutOfSync, cert warnings. Always deliver (no `[SILENT]`). |
+
+Design rules baked into SOUL.md:
+- **Read-only checks** run frequently (10–30m) and stay silent unless wrong.
+- **Mutating actions** are restricted to safe idempotent ones (rollout restart,
+  delete stuck pod so controller recreates). Anything riskier → notify me with
+  a proposed command and wait for me to run it (I can reply in Discord to the
+  continuable thread).
+- Cron sessions are isolated and **cannot create new cron jobs** (Hermes
+  disables that inside cron runs) → no runaway loops.
+
+---
+
+## 8. Safety & guardrails
+
+1. **RBAC is the real boundary.** Even if the agent goes rogue, the SA can't
+   touch other namespaces' secrets beyond read, can't change RBAC, can't taint
+   nodes, can't create namespaces.
+2. **`tool_loop_guardrails.hard_stop_enabled: true`** — circuit-breaks a stuck
+   gateway (recommended in the Docker doc for unattended deployments).
+3. **`skills.write_approval: false` but `memory.write_approval: true`** (so the
+   agent can build skills/memories but I review memory writes lazily — flip
+   this if it gets noisy).
+4. **No `pods/exec` subresource** initially (keep the agent from shelling into
+   workloads). Enable later only if you want log-grep-style debugging.
+5. **Dashboard behind ingress TLS + basic auth** (the June-2026 hardening makes
+   auth mandatory on non-loopback binds; we satisfy it with the bundled
+   basic-auth provider).
+6. **Single replica / single-writer PVC** — the Docker doc is explicit that two
+   gateways on the same `/opt/data` corrupt session/memory stores. Use a
+   `podAntiAffinity` so an accidental scale-up doesn't co-run.
+7. **ArgoCD interaction:** keep `syncPolicy.automated.prune+selfHeal` but
+   exclude the live-edited hermes state. Practically: Argo owns the *manifests*
+   (deployment, configmap, secret, pvc), while `/opt/data` (config.yaml,
+   cron/jobs.json, SOUL.md edits made via the dashboard) is runtime state on the
+   PVC and is *not* reconciled by Argo. The ConfigMap only *seeds* it on first
+   boot. Document this clearly in the README so future-you doesn't expect Argo
+   to reset the agent's personality.
+
+---
+
+## 9. Rollout plan
+
+1. Build & push the derived image to `git.rogi.casa/roger/hermes-agent` (tag
+   `v1.35-<sha>`).
+2. Create the namespace + RBAC + Secret + ConfigMap + PVC:
+   `kubectl apply -f platform-engineer/`.
+3. Create the `platform-engineer` Discord bot, invite it, put its token + your
+   channel id in `secret.yaml` (base64).
+4. Apply the Deployment; wait for the pod to go Running.
+5. `kubectl exec` in and run the one-shot cron seed:
+   `hermes cron create ...` (or apply the `cron-seed` Job).
+6. Trigger the first `cluster-health-check` manually: `hermes cron run cluster-health-check`.
+7. Add the app to `argocd/gen-apps.sh`, regenerate, commit, push.
+
+---
+
+## 10. Decisions (locked in)
+
+1. **Notifications:** dedicated `platform-engineer` Discord bot → its own token
+   in `secret.yaml` (`DISCORD_BOT_TOKEN`, `DISCORD_HOME_CHANNEL`).
+2. **Dashboard:** public at `hermes.rogi.casa` (Traefik TLS + cert-manager + the
+   bundled Hermes basic-auth provider). Reach the dashboard on port 9119; the
+   gateway API on 8642 is ClusterIP-only.
+3. **Image:** derived image pushed to `git.rogi.casa/roger/hermes-agent`, pulled
+   via the existing `gitea-registry` imagePullSecret (must also exist in the
+   `platform-engineer` ns — see deploy steps).
+4. **Model:** `qwen-3.6:27b` via the in-cluster Ollama box (`10.88.20.12:11434`),
+   exposed through LiteLLM as `qwen-3.6:27b`. Added to `litellm/litellm.yaml`.
+   Hermes reaches LiteLLM at `https://litellm.rogi.casa/v1` (never Ollama directly).
+5. **pods/exec:** granted (`pods/exec` → `create` in the ClusterRole) so the
+   agent can `kubectl exec`/`kubectl logs` for debugging.
+
+---
+
+## 11. Deployment checklist (do in this order)
+
+1. **Add the Ollama model to LiteLLM** (already done in `litellm/litellm.yaml`):
+   the `qwen-3.6:27b` entry points at `http://10.88.20.12:11434`. Make sure
+   `qwen3.6:27b` is actually pulled on that Ollama host
+   (`ollama pull qwen3.6:27b`). Apply: `kubectl apply -f litellm/` and restart
+   the LiteLLM pod so the new config takes effect.
+2. **Create the `gitea-registry` secret in the new namespace** (ArgoCD won't
+   create it — it's not in the repo):
+   ```
+   kubectl create namespace platform-engineer
+   kubectl create secret docker-registry gitea-registry \
+     --docker-server=git.rogi.casa \
+     --docker-username=<your-gitea-user> \
+     --docker-password=<gitea-access-token> \
+     --docker-email=<your-email> \
+     -n platform-engineer
+   ```
+3. **Build & push the image:** `./platform-engineer/build-and-push.sh`
+   (after `docker login git.rogi.casa`).
+4. **Create the dedicated Discord bot**, invite it to your server, and put the
+   token + your channel id (base64) into `platform-engineer/secret.yaml`. Also
+   set the LiteLLM master key as `OPENAI_API_KEY` and a strong dashboard
+   password + a 32-byte session secret.
+5. **Commit & push** the whole change. ArgoCD will create the namespace
+   resources, deploy the pod, and bring up the ingress at `hermes.rogi.casa`.
+6. **Seed the cron jobs:**
+   `kubectl apply -f platform-engineer/cron-seed.yaml` (one-shot Job) — it waits
+   for the hermes pod, then runs `hermes cron create ...` for each watchdog.
+   Re-run it any time you want to re-seed after a wipe.
+7. **Smoke test:** trigger the first health check manually —
+   `kubectl exec -n platform-engineer deploy/hermes -- hermes cron run cluster-health-check` —
+   and confirm the message lands in Discord.
+8. **ArgoCD:** the `Application` (`argocd/apps/platform-engineer.yaml`) is
+   already generated. After commit, Argo will reconcile it like every other app.
+
+## 12. What ArgoCD owns vs. what is runtime state
+
+- **ArgoCD owns** (in git): namespace, RBAC, Secret, ConfigMap (seed), PVC,
+  Deployment, Service, Ingress, cron-seed Job.
+- **Runtime state (on the PVC, NOT reconciled):** `config.yaml`, `SOUL.md`,
+  `.env`, `cron/jobs.json`, `sessions/`, `memories/`, `skills/`. The ConfigMap
+  only *seeds* these on first boot; after that, edits you make via the
+  dashboard or `hermes cron edit` persist on the PVC and Argo will not revert
+  them. If you ever want a hard reset, delete the PVC and re-apply.
+
+---
+
+## Files in this directory
+
+| File | Purpose |
+|------|---------|
+| `namespace.yaml` | namespace `platform-engineer` |
+| `rbac.yaml` | ServiceAccount + ClusterRole (+binding), least-privilege |
+| `configmap.yaml` | seed `config.yaml` + `SOUL.md` |
+| `secret.yaml` | Discord token, LiteLLM key, dashboard auth (PLACEHOLDERS — fill in) |
+| `pvc.yaml` | 5 Gi PVC for `/opt/data` |
+| `dockerfile` | derived image: hermes-agent + kubectl + helm (linux/amd64) |
+| `build-and-push.sh` | builds & pushes the image to the Gitea registry |
+| `deployment.yaml` | Deployment (1 replica, Recreate, pinned to amd64 NUC) + Service |
+| `ingress.yaml` | `hermes.rogi.casa` → dashboard (TLS + basic auth) |
+| `cron-seed.yaml` | one-shot Job that creates the Hermes cron schedule |
+
+Also changed outside this directory:
+- `litellm/litellm.yaml` — added `qwen-3.6:27b` model entry.
+- `argocd/gen-apps.sh` + `argocd/apps/platform-engineer.yaml` — ArgoCD
+  Application for this folder.
+```
--- a/platform-engineer/build-and-push.sh
+++ b/platform-engineer/build-and-push.sh
@@ -0,0 +1,43 @@
+#!/usr/bin/env bash
+# Build & push the derived Hermes image (kubectl + helm).
+#
+# Two modes:
+#   ./build-and-push.sh push    # build + push to the Gitea registry
+#   ./build-and-push.sh local   # build + import directly into the NUC's k3s containerd
+#                               # (no registry needed; pod is pinned to this node)
+#
+# Default (no arg): push.
+set -euo pipefail
+
+# Docker registry pushes can't go through the Cloudflare proxy (100 MB cap),
+# so push to the DNS-only registry hostname instead of git.rogi.casa.
+# Override with: REGISTRY=git.rogi.casa ./build-and-push.sh push   (if grey-clouded)
+REGISTRY="${REGISTRY:-registry.rogi.casa}"
+REPO="roger/hermes-agent"
+TAG="${TAG:-v1.35-1}"
+IMAGE="${REGISTRY}/${REPO}:${TAG}"
+MODE="${1:-push}"
+
+cd "$(dirname "$0")"
+
+echo "==> Building ${IMAGE}"
+docker build --platform linux/amd64 -t "${IMAGE}" -f dockerfile .
+
+case "$MODE" in
+  push)
+    echo "==> Pushing ${IMAGE}"
+    docker push "${IMAGE}"
+    echo "==> Done. If the pod can't pull, create the gitea-registry secret in the namespace."
+    ;;
+  local)
+    # Requires k3s + being run on the node the pod schedules to (roger-nucbox-evo-x2).
+    echo "==> Importing into k3s containerd (requires sudo)"
+    docker save "${IMAGE}" | sudo k3s ctr images import -
+    echo "==> Done. Verify: sudo k3s ctr images ls | grep hermes-agent"
+    echo "    deployment.yaml is set to imagePullPolicy: IfNotPresent"
+    ;;
+  *)
+    echo "Usage: $0 {push|local}" >&2
+    exit 1
+    ;;
+esac
--- a/platform-engineer/configmap.yaml
+++ b/platform-engineer/configmap.yaml
@@ -0,0 +1,115 @@
+# Hermes configuration, SOUL.md, and the cron-seed script.
+# Seeded into the PVC (/opt/data) by the initContainer on first boot only.
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: hermes-seed
+  namespace: platform-engineer
+data:
+  config.yaml: |
+    model:
+      provider: openai-api
+      default: qwen-3.6:27b
+      base_url: "https://litellm.rogi.casa/v1"
+      api_mode: chat_completions
+
+    # Cheap/fast model for auxiliary tasks (titling, compression).
+    auxiliary:
+      compression:
+        provider: openai-api
+        model: qwen-3.6:27b
+        base_url: "https://litellm.rogi.casa/v1"
+      title_generation:
+        provider: openai-api
+        model: qwen-3.6:27b
+        base_url: "https://litellm.rogi.casa/v1"
+
+    terminal:
+      backend: local
+      cwd: /workspace
+      timeout: 180
+      home_mode: profile
+
+    # Unattended gateway → circuit-break on stuck tool-call loops.
+    tool_loop_guardrails:
+      hard_stop_enabled: true
+      hard_stop_after:
+        exact_failure: 5
+        idempotent_no_progress: 5
+
+    sessions:
+      auto_prune: true
+      retention_days: 90
+
+    cron:
+      wrap_response: false
+
+    memory:
+      memory_enabled: true
+      user_profile_enabled: true
+      write_approval: false
+
+    skills:
+      write_approval: false
+
+  SOUL.md: |
+    # Platform Engineer — rogi.casa k3s cluster
+
+    You are the autonomous Platform Engineer for the `rogi.casa` K3s cluster.
+    You run *inside* the cluster (namespace `platform-engineer`) and your job is
+    to keep it healthy, fix small problems before they grow, and notify your
+    owner (Roger) on Discord when something needs a human.
+
+    ## The cluster you look after
+
+    - **Nodes:**
+      - `raspberrypi` — control-plane, arm64 (4 GiB)
+      - `rpi2` — worker, arm, very low memory (~512 MiB)
+      - `roger-nucbox-evo-x2` — worker, amd64, 24 GiB (you run here)
+    - **GitOps:** ArgoCD owns every app from `https://git.rogi.casa/roger/k3s-cluster.git`.
+      Each app lives in its own folder; manifests are reconciled with prune + selfHeal.
+    - **Ingress:** Traefik; TLS via cert-manager + `letsencrypt-prod` Cloudflare Origin issuer.
+    - **LLM gateway:** LiteLLM at `https://litellm.rogi.casa/v1` — this is *your* model provider (you reach it through the Traefik ingress, never Ollama directly).
+    - **Services:** glance, pihole, litellm, gitea, home-assistant, jellyfin, n8n,
+      openwebui, phoenix, vaultwarden, qbittorrent, minecraft, monitoring
+      (prometheus + grafana), fava, myorg-assistant, gym-tracker, nas-proxy.
+    - **Your own RBAC** lets you read almost everything and mutate only an
+      allowlist (restart deployments/statefulsets/daemonsets, delete a stuck pod,
+      delete/patch jobs/cronjobs, `kubectl exec`). You CANNOT edit RBAC, taint
+      nodes, create/delete namespaces, or touch CRDs — if you think you need to,
+      propose the command to Roger and stop.
+
+    ## Operating rules
+
+    1. **Read first, act second.** Before changing anything, gather the evidence:
+       `kubectl describe`, `kubectl logs`, `kubectl get events --since=...`,
+       `kubectl top`. Cite the exact resource (ns/name) and the exact command in
+       every report.
+    2. **Only safe, idempotent remediations.** Allowed actions:
+       - `kubectl rollout restart deployment/<name> -n <ns>` (and statefulset/daemonset)
+       - delete a single stuck `CrashLoopBackOff`/`ImagePullBackOff` pod so its
+         controller recreates it
+       - `kubectl delete job/<name>` / `kubectl patch cronjob ...`
+       Never run a command that affects more than one workload at a time unless
+       Roger asked for it.
+    3. **When in doubt, notify, don't act.** If a fix is risky, unusual, or would
+       touch state you can't reach (RBAC, nodes, CRDs, PVC data), post the
+       proposed command to Discord and wait for Roger to reply.
+    4. **Be quiet when healthy.** Watchdog cron jobs reply with exactly `[SILENT]`
+       when there is nothing to report. Failed jobs always deliver regardless.
+    5. **No runaway loops.** You cannot create new cron jobs from inside a cron run
+       (Hermes disables that). Do not try.
+    6. **Talk like an engineer.** Short, concrete, with resource names and
+       commands. No filler. When you fixed something, say what you did in one line.
+    7. **Respect GitOps.** If an app is `OutOfSync`/`Degraded` in ArgoCD, do not
+       hand-edit resources to "fix" it — Argo will revert you. Report it so Roger
+       can fix the source repo.
+
+    ## How you reach Roger
+
+    Notifications go to Discord (your home channel). Cron jobs deliver there by
+    default (`deliver="discord"`). Keep messages under ~1800 chars; attach
+       longer logs as `kubectl logs ... > /opt/data/cron/output/<file>` and link
+       the path.
+    ```
--- a/platform-engineer/cron-seed.yaml
+++ b/platform-engineer/cron-seed.yaml
@@ -0,0 +1,74 @@
+# One-shot Job that seeds Hermes' built-in cron schedule on first install.
+# Idempotent: skips job names that already exist.
+#
+# The agent's own cron jobs live in /opt/data/cron/jobs.json on the PVC and are
+# NOT reconciled by ArgoCD (runtime state). Re-run this Job manually after a
+# wipe to re-seed:  kubectl job restart hermes-cron-seed -n platform-engineer
+---
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: hermes-cron-seed
+  namespace: platform-engineer
+  labels:
+    app: hermes
+spec:
+  backoffLimit: 4
+  ttlSecondsAfterFinished: 86400
+  template:
+    metadata:
+      labels:
+        app: hermes
+    spec:
+      serviceAccountName: platform-engineer
+      restartPolicy: OnFailure
+      containers:
+      - name: seed
+        image: bitnami/kubectl:1.35
+        command: ["sh", "-c"]
+        args:
+        - |
+          set -e
+          echo "Waiting for hermes pod to be Ready..."
+          kubectl -n platform-engineer wait --for=condition=Ready pod -l app=hermes --timeout=300s || true
+
+          POD=$(kubectl -n platform-engineer get pod -l app=hermes -o jsonpath='{.items[0].metadata.name}')
+          echo "Using pod: $POD"
+
+          exists() { kubectl -n platform-engineer exec "$POD" -- hermes cron list 2>/dev/null | grep -qi "name=$1\| $1 "; }
+
+          create() {
+            name="$1"; schedule="$2"; deliver="$3"; prompt="$4"
+            if exists "$name"; then
+              echo "cron job '$name' already exists — skipping"
+            else
+              echo "creating cron job '$name' ..."
+              kubectl -n platform-engineer exec "$POD" -- hermes cron create "$schedule" "$prompt" --name "$name" --deliver "$deliver"
+            fi
+          }
+
+          # ---- Watchdog checks (silent unless something is wrong) ----
+          create "cluster-health-check" "every 15m" "discord" \
+            "Run: kubectl get nodes; kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded; kubectl get events -A --field-selector type=Warning --since=20m. If everything is healthy and there are no Warning events, reply with exactly [SILENT]. Otherwise give a concise per-resource summary of what is wrong (node name, pod ns/name, phase, last event)."
+
+          create "pod-restart-loop" "every 10m" "discord" \
+            "Find pods in CrashLoopBackOff or ImagePullBackOff across all namespaces (kubectl get pods -A). For each, fetch kubectl logs (previous) and describe. If the cause is clearly transient (OOM kill, a one-off config parse error that will retry cleanly, a missing Secret the controller will recreate), attempt ONE safe remediation: kubectl rollout restart of the owning Deployment/StatefulSet/DaemonSet, OR delete the single stuck pod. Report what you did in one line per resource. If the cause is not clearly transient (bad image, missing config, auth failure), do NOT act — post the log excerpt and the proposed command and wait for Roger. If no such pods exist, reply [SILENT]."
+
+          create "pvc-pressure" "every 30m" "discord" \
+            "Check cluster storage health: kubectl get pv,pvc -A; kubectl top nodes. Alert if any PVC is Pending/Lost or any node filesystem usage is over 85%. If all healthy, reply [SILENT]."
+
+          create "argocd-sync-health" "every 1h" "discord" \
+            "Run: kubectl get applications -n argocd -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status. If every app is Synced and Healthy, reply [SILENT]. Otherwise list the OutOfSync/Degraded apps with their status. Do NOT hand-edit resources to fix them (Argo will revert) — just report."
+
+          create "cert-expiry" "0 9 * * *" "discord" \
+            "List all cert-manager Certificate resources (kubectl get certificates -A). For each, check notAfter. Alert on any certificate expiring in under 21 days. If none, reply [SILENT]."
+
+          create "node-resource-drift" "every 30m" "discord" \
+            "Run kubectl top nodes. If any node CPU or memory usage is over 90%, or any node is NotReady, report it with the numbers. Otherwise reply [SILENT]."
+
+          # ---- Daily report (always delivered) ----
+          create "daily-cluster-report" "0 8 * * *" "discord" \
+            "Produce a daily cluster report for Roger: (1) node count + Ready/NotReady; (2) top 5 pods by CPU and by memory across all namespaces (kubectl top pods -A --sort-by); (3) count of pods not Running; (4) ArgoCD apps OutOfSync or Degraded; (5) any certificates expiring within 30 days; (6) any recent Warning events (last 24h). Keep it under 1800 chars. Always deliver (no [SILENT])."
+
+          echo "Done. Listing all cron jobs:"
+          kubectl -n platform-engineer exec "$POD" -- hermes cron list
--- a/platform-engineer/deployment.yaml
+++ b/platform-engineer/deployment.yaml
@@ -0,0 +1,134 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: hermes
+  namespace: platform-engineer
+  labels:
+    app: hermes
+spec:
+  replicas: 1   # MUST be 1 — Hermes' /opt/data is single-writer.
+  strategy:
+    type: Recreate   # never run two pods against the same PVC
+  selector:
+    matchLabels:
+      app: hermes
+  template:
+    metadata:
+      labels:
+        app: hermes
+    spec:
+      serviceAccountName: platform-engineer
+      imagePullSecrets:
+      - name: gitea-registry
+
+      # Pin to the powerful amd64 node (image is linux/amd64; the NUC has 24 GiB).
+      nodeSelector:
+        kubernetes.io/arch: amd64
+      affinity:
+        nodeAffinity:
+          preferredDuringSchedulingIgnoredDuringExecution:
+          - weight: 100
+            preference:
+              matchExpressions:
+              - key: hardware
+                operator: In
+                values: ["high-memory"]
+        podAntiAffinity:
+          preferredDuringSchedulingIgnoredDuringExecution:
+          - weight: 100
+            podAffinityTerm:
+              labelSelector:
+                matchLabels:
+                  app: hermes
+              topologyKey: kubernetes.io/hostname
+
+      initContainers:
+      # Seed /opt/data with config.yaml + SOUL.md on first boot only.
+      # ArgoCD owns the manifests; the PVC is runtime state and is NOT reconciled.
+      - name: seed-data
+        image: busybox:1.36
+        command: ["sh", "-c"]
+        args:
+        - |
+          set -e
+          if [ ! -f /opt/data/config.yaml ]; then
+            echo "First boot: seeding /opt/data from ConfigMap..."
+            cp /seed/config.yaml /opt/data/config.yaml
+            cp /seed/SOUL.md /opt/data/SOUL.md
+            chmod 600 /opt/data/config.yaml
+          else
+            echo "/opt/data already initialized — leaving runtime state intact."
+          fi
+          mkdir -p /opt/data/home/.kube /opt/data/cron/output /opt/data/scripts /workspace
+        volumeMounts:
+        - name: data
+          mountPath: /opt/data
+        - name: seed
+          mountPath: /seed
+
+      containers:
+      - name: hermes
+        image: registry.rogi.casa/roger/hermes-agent:v1.35-1
+        imagePullPolicy: IfNotPresent   # falls back to local image if present
+        command: ["gateway", "run"]
+        ports:
+        - name: gateway
+          containerPort: 8642
+        - name: dashboard
+          containerPort: 9119
+        envFrom:
+        - secretRef:
+            name: hermes-env
+        env:
+        # k3s injects these automatically; kubectl inside the pod uses the SA token.
+        - name: HERMES_HOME
+          value: /opt/data
+        volumeMounts:
+        - name: data
+          mountPath: /opt/data
+        - name: workspace
+          mountPath: /workspace
+        resources:
+          requests:
+            memory: "512Mi"
+            cpu: "250m"
+          limits:
+            memory: "2Gi"
+            cpu: "1000m"
+        livenessProbe:
+          httpGet:
+            path: /health
+            port: 8642
+          initialDelaySeconds: 60
+          periodSeconds: 30
+          failureThreshold: 3
+        securityContext:
+          allowPrivilegeEscalation: false
+          runAsNonRoot: false   # official image runs as root for s6 init then drops to hermes
+
+      volumes:
+      - name: data
+        persistentVolumeClaim:
+          claimName: hermes-data
+      - name: workspace
+        emptyDir: {}
+      - name: seed
+        configMap:
+          name: hermes-seed
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: hermes
+  namespace: platform-engineer
+spec:
+  type: ClusterIP
+  selector:
+    app: hermes
+  ports:
+    - name: gateway
+      port: 80
+      targetPort: 8642
+    - name: dashboard
+      port: 9119
+      targetPort: 9119
--- a/platform-engineer/dockerfile
+++ b/platform-engineer/dockerfile
@@ -0,0 +1,31 @@
+# Derived Hermes Agent image with kubectl + helm so the agent can drive the
+# k3s cluster from inside the container (terminal backend = local).
+#
+# Build & push to the Gitea registry:
+#   docker build -t git.rogi.casa/roger/hermes-agent:v1.35-1 -f dockerfile .
+#   docker push git.rogi.casa/roger/hermes-agent:v1.35-1
+#
+# This image targets linux/amd64 (the agent pod is pinned to the amd64 NUC).
+FROM nousresearch/hermes-agent:latest
+
+USER root
+
+# kubectl (v1.35 to match the cluster's k3s version)
+RUN apt-get update \
+ && apt-get install -y --no-install-recommends curl gnupg ca-certificates \
+ && curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key \
+    | gpg --dearmor -o /usr/share/keyrings/kubernetes-apt-keyring.gpg \
+ && echo 'deb [signed-by=/usr/share/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /' \
+    > /etc/apt/sources.list.d/kubernetes.list \
+ && apt-get update \
+ && apt-get install -y --no-install-recommends kubectl \
+ # helm
+ && curl -fsSL https://get.helm.sh/helm-v3.16.3-linux-amd64.tar.gz \
+    | tar -xz -C /usr/local/bin --strip-components=1 linux-amd64/helm \
+ && apt-get clean \
+ && rm -rf /var/lib/apt/lists/*
+
+# Hermes' own CLI/kubeconfig helper dir for tool subprocesses
+RUN mkdir -p /opt/data/home/.kube
+
+USER hermes
--- a/platform-engineer/ingress.yaml
+++ b/platform-engineer/ingress.yaml
@@ -1,24 +1,24 @@
 apiVersion: networking.k8s.io/v1
 kind: Ingress
 metadata:
-  name: minecraft
-  namespace: minecraft
+  name: hermes
+  namespace: platform-engineer
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
 spec:
  ingressClassName: traefik
  tls:
  - hosts:
-    - minecraft.rogi.casa
-    secretName: minecraft-tls
+    - hermes.rogi.casa
+    secretName: hermes-tls
  rules:
-  - host: minecraft.rogi.casa
+  - host: hermes.rogi.casa
    http:
      paths:
        - path: /
          pathType: Prefix
          backend:
            service:
-              name: minecraft-server
+              name: hermes
              port:
-                number: 25565
+                number: 9119   # dashboard
--- a/platform-engineer/namespace.yaml
+++ b/platform-engineer/namespace.yaml
@@ -0,0 +1,4 @@
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: platform-engineer
--- a/platform-engineer/pvc.yaml
+++ b/platform-engineer/pvc.yaml
@@ -0,0 +1,11 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: hermes-data
+  namespace: platform-engineer
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 5Gi
--- a/platform-engineer/rbac.yaml
+++ b/platform-engineer/rbac.yaml
@@ -0,0 +1,111 @@
+# Least-privilege RBAC for the Platform Engineer Hermes agent.
+#
+# The agent can READ almost everything cluster-wide, but can only MUTATE a
+# narrow allowlist of safe, idempotent resources (restart deployments, delete a
+# stuck pod so its controller recreates it, etc.). It CANNOT touch RBAC, nodes,
+# namespaces, CRDs, or other namespaces' Secrets beyond read.
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: platform-engineer
+  namespace: platform-engineer
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: platform-engineer
+rules:
+  # ---- Broad read access (cluster-wide) ----
+  - apiGroups: [""]
+    resources:
+      - nodes
+      - nodes/proxy
+      - services
+      - endpoints
+      - pods
+      - pods/log
+      - configmaps
+      - secrets
+      - persistentvolumeclaims
+      - persistentvolumes
+      - namespaces
+      - events
+      - replicationcontrollers
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["apps"]
+    resources:
+      - deployments
+      - statefulsets
+      - daemonsets
+      - replicasets
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["batch"]
+    resources:
+      - jobs
+      - cronjobs
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["networking.k8s.io"]
+    resources:
+      - ingresses
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["autoscaling"]
+    resources:
+      - horizontalpodautoscalers
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["argoproj.io"]
+    resources:
+      - applications
+      - appprojects
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["cert-manager.io"]
+    resources:
+      - certificates
+      - certificaterequests
+      - clusterissuers
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["metrics.k8s.io"]
+    resources:
+      - pods
+      - nodes
+    verbs: ["get", "list"]
+
+  # ---- Metrics / health endpoints ----
+  - nonResourceURLs: ["/metrics", "/metrics/*"]
+    verbs: ["get"]
+
+  # ---- Narrow mutate allowlist (idempotent, safe remediation) ----
+  # Restart a stuck pod by deleting it (its controller recreates it).
+  - apiGroups: [""]
+    resources: ["pods"]
+    verbs: ["delete", "patch"]
+  # `kubectl rollout restart` and scaling for the apps/batch controllers.
+  - apiGroups: ["apps"]
+    resources:
+      - deployments
+      - statefulsets
+      - daemonsets
+      - replicasets
+    verbs: ["patch", "update"]
+  - apiGroups: ["batch"]
+    resources:
+      - jobs
+      - cronjobs
+    verbs: ["patch", "update", "delete"]
+  # Exec into pods for log-style / debug inspection (granted per request #5).
+  - apiGroups: [""]
+    resources: ["pods/exec"]
+    verbs: ["create"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: platform-engineer
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: platform-engineer
+subjects:
+  - kind: ServiceAccount
+    name: platform-engineer
+    namespace: platform-engineer
Author	SHA1	Message	Date
Roger Oriol	3f3467cb13	gitea registry ingress	2026-06-27 11:46:53 +02:00
Roger Oriol	6e02d9a885	new platform engineer agent	2026-06-27 00:09:39 +02:00
Roger Oriol	d8012dfb6c	monitoring: add dashboard ideas doc Survey of dashboards that could be built from existing and not-yet-enabled metrics across the cluster's services (traefik, coredns, metallb, cert-manager, phoenix, litellm, gitea, postgres, etc.), with per-service enable steps and a recommended priority order.	2026-06-26 20:22:54 +02:00
Roger Oriol	bf1387dc3e	monitoring: add Grafana dashboards + kube-state-metrics & node-exporter Dashboards (provisioned via ConfigMaps into Grafana pod, 'K3s Cluster' folder): - Cluster Overview: per-namespace CPU/mem/net/fs, pod counts, pod health (KSM) - Pods & Services: per-pod CPU/mem/net/fs, throttling, pod status, restarts, PVCs - Nodes: per-node CPU%/mem%, load average, disk usage, network (node-exporter) - Control Plane & API Server: request rate, latency p95, 5xx, kubelet/PLEG - Prometheus Self-Monitoring: ingestion, series, scrape duration, memory Exporters (auto-scraped via existing kubernetes-service-endpoints job): - kube-state-metrics: pod/deployment/PVC/replica state (kube_pod_status_phase, kube_pod_container_status_restarts_total, kube_persistentvolumeclaim_) - node-exporter (DaemonSet, hostNetwork): node_cpu_seconds_total, node_memory_, node_filesystem_, node_load, node_network_*	2026-06-26 19:48:17 +02:00
Roger Oriol	2eab82b430	fix nas ingress	2026-06-26 19:01:08 +02:00
Roger Oriol	3cdd40153f	fix nas ingress	2026-06-26 18:54:17 +02:00
Roger Oriol	9f74a88be7	fix nas ingress	2026-06-26 18:40:41 +02:00
Roger Oriol	586e95a57d	fix nas ingress	2026-06-26 18:25:29 +02:00
Roger Oriol	9f7e34ef78	fix prometheus ingress	2026-06-26 18:06:01 +02:00
Roger Oriol	b43874bdcd	Expose minecraft server over TCP via MetalLB Minecraft Java Edition uses raw TCP on port 25565, not HTTP. The previous ClusterIP Service + HTTP Ingress (Traefik 80/443) could not carry TCP 25565 traffic, so minecraft.rogi.casa:25565 was unreachable. - Change Service to LoadBalancer with fixed IP 10.88.20.103 (dmz-pool), matching the pihole-dns pattern, so port 25565 is exposed directly. - Remove the dead HTTP ingress (it routed HTTP to a TCP game port).	2026-06-26 13:38:43 +02:00
Roger Oriol	da2bae6fa5	Merge branch 'main' of https://git.rogi.casa/roger/k3s-cluster	2026-06-26 12:01:29 +02:00
Roger Oriol	e77e170421	fix(homeassistant): trust k3s pod/service CIDRs as X-Forwarded-For proxies HA runs with hostNetwork on roger-nucbox-evo-x2 while Traefik runs on the raspberrypi node, so requests arrive at HA from 10.88.20.11. The previous trusted_proxies entry (10.88.88.0/24) did not include this address, causing HA to reject X-Forwarded-For and return 400 on every ingress request.	2026-06-26 11:58:46 +02:00