diff --git a/monitoring/dashboard-ideas.md b/monitoring/dashboard-ideas.md new file mode 100644 index 0000000..6d573c4 --- /dev/null +++ b/monitoring/dashboard-ideas.md @@ -0,0 +1,354 @@ +# Dashboard Ideas + +This file collects ideas for additional Grafana dashboards to build for the +`rogi.casa` k3s cluster. Each idea notes the **data source** (metrics already +available vs. metrics that need to be enabled) and a rough panel layout. + +To actually add a dashboard, create a `grafana-dashboard-.yaml` ConfigMap +in this folder, mount it in `grafana-deployment.yaml` (add a volume + +volumeMount under `/var/lib/grafana/dashboards/`), commit and push. + +--- + +## Already-scraped services (ready to dashboard now) + +These exporters/services are **already being scraped by Prometheus** — dashboards +can be built immediately with no infra changes. + +### 1. Traefik (Ingress) — `traefik_*` +Traefik is scraped via the `kubernetes-pods` job (pod annotation on +`traefik-9bcdbbd9-x8zq4` in `kube-system`). It exposes request counters, entry +point latency, TLS handshakes, config reloads. + +**Panels:** +- Requests/sec by entrypoint (web / websecure / traefik) — `rate(traefik_entrypoint_requests_total[5m])` +- Request latency p50/p95/p99 — `histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le, entrypoint))` +- HTTP status code distribution (2xx/3xx/4xx/5xx) — `rate(traefik_entrypoint_requests_total{code=~"2xx|3xx|4xx|5xx"}[5m])` +- TLS handshakes/sec — `rate(traefik_entrypoint_requests_tls_total[5m])` +- Config reloads + last reload success — `traefik_config_reloads_total`, `traefik_config_last_reload_success` +- Top routes/services by request volume — `topk(10, sum by (service) (rate(traefik_service_requests_total[5m])))` +- Bytes transferred in/out — `rate(traefik_entrypoint_requests_bytes_total[5m])` + +**Why useful:** This is your front door. Knowing which routes get hit most, +latency per ingress, and 5xx spikes is the single most valuable app-level +dashboard in the cluster. + +--- + +### 2. CoreDNS (cluster DNS) — `coredns_*` +Scraped via `kube-dns` Service annotation. Exposes query rate, cache hits, +error types, response duration. + +**Panels:** +- DNS queries/sec by zone / type — `rate(coredns_dns_requests_total[5m])` +- Cache hit ratio — `rate(coredns_cache_hits_total[5m]) / rate(coredns_cache_requests_total[5m])` +- DNS query latency p95 — `histogram_quantile(0.95, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))` +- Queries by response code (NOERROR / NXDOMAIN / SERVFAIL) — `rate(coredns_dns_responses_total[5m])` +- Cache size — `coredns_cache_entries` +- Forward requests/sec (upstream DNS) — `rate(coredns_forward_requests_total[5m])` + +**Why useful:** DNS issues cause cascading failures (ImagePullBackOff, cert +challenges, etc.). A spike in NXDOMAIN/SERVFAIL is an early warning. + +--- + +### 3. MetalLB (LoadBalancer) — `metallb_*` +Scraped via pod annotation on `speaker-*` and `controller` in `metallb-system`. +Exposes IP allocation usage, BGP/session state. + +**Panels:** +- IP addresses in use vs. total — `metallb_allocator_addresses_in_use_total` / `metallb_allocator_addresses_total` +- IP pool utilization % (gauge) — `metallb_allocator_addresses_in_use_total / metallb_allocator_addresses_total * 100` +- BGP session up per speaker — `metallb_bgp_session_up` +- Config loaded / stale status — `metallb_k8s_client_config_loaded_bool`, `metallb_k8s_client_config_stale_bool` +- Announcements per speaker — `rate(metallb_bgp_announcements_total[5m])` + +**Why useful:** If MetalLB runs out of IPs, new LoadBalancer services will +hang in ``. Knowing pool utilization lets you act before that happens. + +--- + +### 4. cert-manager (TLS certificates) — `certmanager_*` +Scraped via pod annotations on cert-manager pods. Exposes certificate +expiration, renewal, ready status, ACME challenges. + +**Panels:** +- Certificate expiration (days remaining, sorted) — table of `(certmanager_certificate_not_after_timestamp_seconds - time()) / 86400` +- Certificates not Ready — `certmanager_certificate_ready_status{condition="Ready",status!="True"}` +- Upcoming renewals (next 14 days) — `certmanager_certificate_renewal_timestamp_seconds` +- ACME challenge status — `certmanager_certificate_challenge_status` +- Failed renewals counter — `rate(certmanager_certificate_renewal_total{condition="Failed"}[1h])` + +**Why useful:** A cert about to expire (or silently failing to renew) is the +kind of thing that takes down `*.rogi.casa` HTTPS with no warning. This is a +must-have alert/dashboard. + +--- + +### 5. Phoenix (trace store) — `phoenix_*` +Already scraped via the `phoenix` Service annotation. Exposes bulk loader +ingestion rates, span insertion times, retention sweeper, exceptions. + +**Panels:** +- Span ingestion rate — `rate(phoenix_bulk_loader_span_insertion_time_seconds_count[5m])` +- Span insertion latency p95 — `histogram_quantile(0.95, sum(rate(phoenix_bulk_loader_span_insertion_time_seconds_bucket[5m])) by (le))` +- Span exceptions/sec — `rate(phoenix_bulk_loader_span_exceptions_total[5m])` +- Retention sweeper last run — `phoenix_retention_sweeper_last_run_seconds` +- Last activity timestamp — `phoenix_bulk_loader_last_activity_timestamp_seconds` + +**Why useful:** Phoenix is your observability backend's own backend. Tracking +ingestion health tells you whether traces are landing. + +--- + +## Infrastructure dashboards (compose from existing metrics) + +### 6. Storage & PVC Health (KSM + kubelet + node-exporter) +Cross-source dashboard combining `kube_persistentvolumeclaim_*` (KSM), +`kubelet_volume_stats_*` (kubelet), and `node_filesystem_*` (node-exporter). + +**Panels:** +- PVC usage % per claim — `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100` +- PVC requested vs. capacity — `kube_persistentvolumeclaim_resource_requests_storage_bytes` vs actual +- Node disk usage % (all mounts) — `(1 - node_filesystem_avail / node_filesystem_size) * 100` +- Inode usage % per mount — `(1 - node_filesystem_files_free / node_filesystem_files) * 100` +- Volume binding status (Bound/Pending) — `kube_persistentvolumeclaim_status_phase` +- Top 10 PVCs by usage (table) + +**Why useful:** The `local-path` provisioner fills up node disks. Catching a +PVC at 95% before it errors is a lifesaver. + +--- + +### 7. Workload Health (KSM) +Uses kube-state-metrics to show deployment/StatefulSet/CronJob health cluster-wide. + +**Panels:** +- Deployments with unavailable replicas — `kube_deployment_status_replicas_available < kube_deployment_status_replicas` +- Pods not in Running phase by namespace — `kube_pod_status_phase{phase!="Running"}` +- Container restarts (last 1h) — `increase(kube_pod_container_status_restarts_total[1h])` +- Pods stuck in CrashLoopBackOff / ImagePullBackOff — `kube_pod_container_status_waiting_reason{reason=~"CrashLoopBackOff|ImagePullBackOff"}` +- Job failures — `kube_job_failed` +- CronJob schedule heatmap — `kube_cronjob_status_active` +- HPA status (if any autoscaled) — `kube_horizontalpodautoscaler_status_current_replicas` vs desired + +**Why useful:** This is the "is anything broken" board. Notice you already have +some pods in `ImagePullBackOff` (myorg-assistant) — this dashboard surfaces that. + +--- + +### 8. etcd / Control Plane Health (if exposed) +k3s embeds etcd (or sqlite on single-node). etcd metrics require exposing +the etcd `/metrics` endpoint (typically `--listen-metrics-urls` on the control +plane node). **Requires config change to enable.** + +**Panels:** +- Leader changes — `etcd_server_leader_changes_seen_total` +- Proposal commits/sec — `rate(etcd_server_proposals_committed_total[5m])` +- Proposal failures/sec — `rate(etcd_server_proposals_failed_total[5m])` +- DB size — `etcd_mvcc_db_total_size_in_bytes` +- RPC latency p99 — `histogram_quantile(0.99, sum(rate(etcd_grpc Unary grpc latency bucket[5m])) by (le))` +- Active watchers — `etcd_debugging_mvcc_watcher_total` + +**Why useful:** etcd is the brain of the cluster. Slow commits or a flipping +leader indicates control-plane trouble. + +--- + +## App-service dashboards (require enabling metrics first) + +Most of your apps don't expose `/metrics` yet. Below is the per-service setup +plus the dashboard idea once metrics are on. To enable scraping for any of +these, annotate the Service with: + +```yaml +metadata: + annotations: + prometheus.io/scrape: "true" + prometheus.io/port: "" +``` + +The existing `kubernetes-service-endpoints` scrape job will pick them up +automatically — **no Prometheus config edit needed**. + +### 9. LiteLLM (LLM gateway) — needs enabling +LiteLLM exposes Prometheus metrics on its API port (`/metrics`). Annotate the +`litellm` Service. + +**Panels:** +- Requests/sec by model — `rate(litellm_requests_total[5m])` by `model` +- Token usage (prompt/completion/total) — `rate(litellm_total_tokens_total[5m])` +- Spend by model — `litellm_spend_total` (if cost tracking enabled) +- Latency p95 per model — `histogram_quantile(0.95, ...)` +- Error rate by model — `rate(litellm_requests_total{status=~"5.."}[5m])` +- Rate-limit / quota hits + +**Why useful:** LiteLLM is the gateway for all your AI apps (open-webui, +myorg-assistant, etc.). Token spend + per-model latency is the single best +cost/quality lever in the cluster. + +--- + +### 10. Gitea (git + CI) — needs enabling +Gitea exposes metrics at `/metrics` when `ENABLE_METRICS=true` in `app.ini`. +Annotate `gitea-http` Service (port 3000 inside, 80 via svc). + +**Panels:** +- Git push/clone/fetch rate — `gitea_actions_total` by `action` +- Active users / repos / orgs — `gitea_users_total`, `gitea_repos_total` +- Issues / PRs open — `gitea_issues_total`, `gitea_pulls_total` +- HTTP request rate + latency +- Gitea Actions runner job duration — if runner metrics exposed + +**Why useful:** Gitea hosts the cluster's own GitOps repo + CI. Tracking push +rate and runner throughput catches CI storms. + +--- + +### 11. Home Assistant — needs enabling +HA exposes Prometheus metrics via the `prometheus` integration (add to +`configuration.yaml`). Then annotate the Service. + +**Panels:** +- Active entities / sensors by domain +- State change events/sec — `homeassistant_entity_states_total` +- Automation triggers/sec — `homeassistant_automation_triggered_total` +- Integrations loaded + errors +- Database size / recorder queue depth +- Zigbee/Z-Wave mesh health (if exposed) + +**Why useful:** HA is a home-critical service. Event/sec spikes often indicate +sensor flapping or runaway automations. + +--- + +### 12. Jellyfin — limited +Jellyfin doesn't ship first-class Prometheus metrics, but you can scrape it +via a sidecar (`jellyfin-prometheus-exporter`) or build a blackbox-style +dashboard on the `/health` endpoint. + +**Panels:** +- Active streams — from exporter +- Transcode sessions + hw accel usage +- Library size by media type +- Playback errors + +--- + +### 13. Pi-hole — needs enabling +Pi-hole exposes metrics on its FTL web API; the `pihole-exporter` sidecar +converts them to Prometheus format. Add as a sidecar container + annotate. + +**Panels:** +- DNS queries/sec (total, blocked, cached, forwarded) +- Block list size +- Top blocked domains +- Top permitted domains +- Clients by query volume +- Cache hit ratio + +**Why useful:** Pi-hole is your network-wide adblock. Block rate + cache ratio +are the headline metrics, and query spikes reveal misbehaving clients. + +--- + +### 14. PostgreSQL (litellm + phoenix + n8n) — needs enabling +You have two Postgres instances (`postgres` in `litellm` and `phoenix`). +Add `prometheus-postgres-exporter` as a sidecar or Deployment per DB. + +**Panels (per DB):** +- Connections (active / idle / max) — `pg_stat_activity_count` +- Transactions/sec — `rate(pg_stat_database_xact_commit[5m])` +- Cache hit ratio — `pg_stat_database_blks_hit / (blks_hit + blks_read)` +- Table + index bloat +- Replication lag (if replicas) +- Slow queries (if `pg_stat_statements` enabled) +- DB size growth — `pg_database_size_bytes` + +**Why useful:** DB connection exhaustion and cache ratio collapse are the two +most common causes of slow app performance. + +--- + +### 15. Minecraft — limited +The Minecraft server exposes metrics via RCON + an exporter +(`minecraft-exporter`). Add as sidecar using the existing `RCON_PASSWORD`. + +**Panels:** +- Players online — `minecraft_players_online` +- TPS (ticks per second) — `minecraft_tps` (server health) +- Entities loaded — `minecraft_entities_total` +- Chunk count — `minecraft_chunks_loaded` +- Memory used by JVM + +**Why useful:** TPS < 20 means lag. Player count vs. server load is the only +real signal a Minecraft server needs. + +--- + +### 16. qBittorrent — limited +No native metrics. Options: a `qbittorrent-exporter` sidecar (uses the WebUI +API), or a blackbox probe on the WebUI. + +**Panels:** +- Download/upload speed +- Active torrents +- Torrent count by state (downloading/seeding/paused) +- Disk usage in download dir + +--- + +## Cluster meta dashboards + +### 17. Network Topology / Service Map +Composite view: for each namespace, list services, their pods, scrape status, +and request volume (from Traefik logs + cAdvisor network). A "what talks to +what" overview. + +**Panels:** +- Service → pod → container resource table +- Cross-namespace network flows (if network policy logging enabled) +- Scrape health matrix (every target up/down) +- Ingress route → backend service map + +--- + +### 18. Backup / Snapshot Status +If you take Velero snapshots or local-path snapshots, build a dashboard on +`velero_*` or CRD status. **Requires Velero.** + +**Panels:** +- Last successful backup per namespace +- Failed backups +- Backup size growth +- Restore test status + +--- + +### 19. Cost / Capacity Planning +Composite: per-namespace CPU/memory requests vs. actual usage, projected +growth, node saturation forecast. + +**Panels:** +- Requests vs. limits vs. actual (per namespace) — KSM + cAdvisor +- Node capacity vs. allocatable +- PVC growth trend + 30-day forecast +- "What if I removed node X" simulation (capacity headroom) + +**Why useful:** Tells you when you'll need another node before you hit the wall. + +--- + +## Recommended priority order + +If you only build a few, do them in this order (highest value-to-effort first): + +1. **Traefik Ingress** (#1) — already scraped, your front door +2. **Storage & PVC Health** (#6) — local-path fills disks; high blast radius +3. **Workload Health** (#7) — surfaces CrashLoopBackOff / ImagePullBackOff +4. **cert-manager** (#4) — prevents silent cert expiry outages +5. **CoreDNS** (#2) — early warning for DNS cascades +6. **LiteLLM** (#9) — needs `prometheus.io/scrape` annotation only; big insights +7. **MetalLB** (#3) — small but catches LoadBalancer IP exhaustion + +Items 8–19 are nice-to-have or require additional exporters/config.