Files

Roger Oriol aa4793dd51 memory fixes

2026-02-02 20:47:09 +01:00

6.7 KiB

Raw Blame History

Raspberry Pi Node Scheduling Fix - Implementation Guide

Problem Summary

Your Raspberry Pi node (4GB RAM) keeps crashing because high-resource applications are scheduling on it instead of on nodes with more capacity.

Root Causes Identified

High-memory applications without node targeting:
- n8n PostgreSQL: 2-4Gi memory requirements
- Minecraft server: 1-4Gi memory requirements
- OpenWebUI: 1-2Gi memory requirements
- Phoenix services: 512Mi-2Gi memory requirements
- Jellyfin: 512Mi-2Gi memory requirements
Missing node selectors: Only Gitea services target ARM64 architecture
No taints/tolerations: Raspberry Pi node isn't protected from heavy workloads
Resource limits missing: Some applications can consume unlimited resources

Solution Applied

Modified Files with Node Selectors (Prevent RPi Scheduling)

✅ Updated these manifests to include nodeSelector: hardware: high-memory:

/n8n/postgres-deployment.yaml - PostgreSQL (2-4Gi memory)
/minecraft-server/ss.yaml - Minecraft server (1-4Gi memory)
/openwebui/openwebui.yaml - OpenWebUI (1-2Gi memory)
/phoenix/phoenix-statefulset.yaml - Phoenix app (512Mi-2Gi memory)
/phoenix/postgres-statefulset.yaml - Phoenix PostgreSQL (256Mi-1Gi memory)
/jellyfin/jellyfin.yaml - Jellyfin media server (512Mi-2Gi memory)
/monitoring/prometheus-deployment.yaml - Prometheus (512Mi-1Gi memory)

Implementation Steps

Step 1: Label and Taint Your Nodes

# 1. Identify your nodes
kubectl get nodes -o wide

# 2. Label your powerful nodes
kubectl label nodes <powerful-node-1> hardware=high-memory
kubectl label nodes <powerful-node-2> hardware=high-memory

# 3. Label your Raspberry Pi node
kubectl label nodes <raspberry-pi-node> hardware=low-memory
kubectl label nodes <raspberry-pi-node> node-type=raspberry-pi

# 4. Taint the Raspberry Pi to prevent most workloads
kubectl taint nodes <raspberry-pi-node> node-type=raspberry-pi:NoSchedule

Step 2: Apply Updated Manifests

# Apply all updated manifests
kubectl apply -f n8n/postgres-deployment.yaml
kubectl apply -f minecraft-server/ss.yaml
kubectl apply -f openwebui/openwebui.yaml
kubectl apply -f phoenix/phoenix-statefulset.yaml
kubectl apply -f phoenix/postgres-statefulset.yaml
kubectl apply -f jellyfin/jellyfin.yaml
kubectl apply -f monitoring/prometheus-deployment.yaml

Step 3: Force Reschedule Existing Pods

# Delete existing pods to force rescheduling on correct nodes
kubectl delete pods -n n8n -l service=postgres-n8n
kubectl delete pods -n minecraft -l app=minecraft-server
kubectl delete pods -l app=open-webui
kubectl delete pods -n phoenix -l app=phoenix
kubectl delete pods -n phoenix -l app=postgres
kubectl delete pods -n jellyfin -l app=jellyfin
kubectl delete pods -n monitoring -l app=prometheus

Step 4: Verify Pod Scheduling

# Check where pods are scheduled
kubectl get pods -o wide --all-namespaces | grep -E "(n8n|minecraft|openwebui|phoenix|jellyfin|prometheus)"

# Verify node resource usage
kubectl top nodes

# Check events for scheduling issues
kubectl get events --sort-by='.lastTimestamp' | tail -20

Optional: Add Tolerations for Lightweight Services

For services that CAN run on Raspberry Pi, add tolerations:

# Example for Pi-hole (good candidate for RPi)
spec:
  template:
    spec:
      tolerations:
        - key: "node-type"
          operator: "Equal"
          value: "raspberry-pi"
          effect: "NoSchedule"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: node-type
                operator: In
                values: ["raspberry-pi"]

Good candidates for Raspberry Pi:

Pi-hole (DNS filtering)
Home Assistant (IoT hub)
Fava (lightweight accounting)
Vaultwarden (password manager)
Glance (dashboard)

Monitoring and Validation

Check Resource Usage

# Monitor node resource consumption
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=memory

# Check pod distribution across nodes
kubectl get pods -o wide --all-namespaces | awk '{print $8}' | sort | uniq -c

Verify Scheduling Constraints

# Check node labels and taints
kubectl get nodes --show-labels
kubectl describe nodes | grep -E "(Name:|Taints:|Labels:)"

# Verify no high-memory pods on RPi
kubectl get pods -o wide --all-namespaces | grep <raspberry-pi-node-name>

Troubleshooting

If Pods Stay Pending

# Check why pods can't be scheduled
kubectl describe pod <pending-pod-name> -n <namespace>

# Common issues:
# - Node doesn't have required labels
# - Resource requests too high for available nodes
# - No nodes tolerate the pod's requirements

If You Need to Rollback

# Remove node selectors from manifests and reapply
# Remove taints from Raspberry Pi
kubectl taint nodes <raspberry-pi-node> node-type=raspberry-pi:NoSchedule-

# Remove labels if needed
kubectl label nodes <node-name> hardware-
kubectl label nodes <node-name> node-type-

Expected Results

After implementation:

High-resource applications will only schedule on powerful nodes
Raspberry Pi node will be protected from resource-heavy workloads
Cluster stability will improve with proper resource distribution
Pi node crashes should stop occurring
Lightweight services can still run on Pi (with tolerations)

Architecture Summary

┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│   Powerful      │  │   Powerful      │  │  Raspberry Pi   │
│   Node 1        │  │   Node 2        │  │   Node (4GB)    │
│                 │  │                 │  │                 │
│ • n8n Postgres  │  │ • Minecraft     │  │ • Pi-hole       │
│ • Phoenix       │  │ • OpenWebUI     │  │ • Glance        │
│ • Jellyfin      │  │ • Prometheus    │  │ • Fava          │
│ • Grafana       │  │ • Other apps    │  │ • Vaultwarden   │
│                 │  │                 │  │ • Home Asst     │
└─────────────────┘  └─────────────────┘  └─────────────────┘
  hardware=high-mem    hardware=high-mem    hardware=low-mem
                                            TAINTED (protected)

The Raspberry Pi is now protected while still being available for lightweight services that benefit from its unique characteristics.

6.7 KiB Raw Blame History