Troubleshooting¶
Flux Issues¶
Resources Not Syncing¶
# Check Flux health
flux check
# Check for failed reconciliations
flux get ks -A --status-selector ready=false
flux get hr -A --status-selector ready=false
# View Flux logs
flux logs --all-namespaces
# Force sync
task reconcile
HelmRelease Stuck¶
# Suspend and resume
flux suspend hr <name> -n <namespace>
flux resume hr <name> -n <namespace>
# Force reconciliation
flux reconcile hr <name> -n <namespace> --force
Template Issues¶
Templates Not Rendering¶
task template:validate-schemas # Check cluster.yaml & nodes.yaml
task template:render-configs # Force re-render
Secret Issues¶
Secrets Not Decrypting¶
# Verify age key exists
test -f age.key && echo "Key exists" || echo "Missing key"
# Verify SOPS can decrypt
sops --decrypt bootstrap/sops-age.sops.yaml
# Check SOPS_AGE_KEY_FILE is set
echo $SOPS_AGE_KEY_FILE
Verifying Encryption¶
Node Issues¶
Nodes Not Joining¶
Node Health¶
Pod Issues¶
General Debugging¶
# List pods in namespace
kubectl -n <namespace> get pods -o wide
# Check pod logs
kubectl -n <namespace> logs <pod-name> -f
# Describe pod for events
kubectl -n <namespace> describe pod <pod-name>
# Check namespace events
kubectl -n <namespace> get events --sort-by='.metadata.creationTimestamp'
CrashLoopBackOff¶
- Check logs:
kubectl -n <ns> logs <pod> --previous - Check resource limits:
kubectl -n <ns> describe pod <pod> - Check if NFS-dependent -- add NFS-scaler component if so
- Check if secret is missing:
kubectl -n <ns> get secrets
Pending Pods¶
- Check events:
kubectl -n <ns> describe pod <pod> - Check node resources:
kubectl top nodes - Check PVC binding:
kubectl -n <ns> get pvc - Check node affinity/taints
Network Issues¶
Cilium¶
DNS¶
# Test cluster DNS
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default
# Test external DNS resolution
dig @<k8s-gateway-ip> <app>.<domain>
Storage Issues¶
NFS Unavailable¶
If NFS is down, pods using NFS volumes will crash-loop. The NFS-scaler component handles this automatically for apps that include it.
Check NFS availability:
PVC Issues¶
Reset Cluster¶
Danger
This destroys everything. Use as last resort.
After reset, re-bootstrap: