Troubleshooting¶
Flux Issues¶
Resources Not Syncing¶
# Check Flux health
flux check
# Check for failed reconciliations
flux get ks -A --status-selector ready=false
flux get hr -A --status-selector ready=false
# View Flux logs
flux logs --all-namespaces
# Force sync
task reconcile
HelmRelease Stuck¶
# Suspend and resume
flux suspend hr <name> -n <namespace>
flux resume hr <name> -n <namespace>
# Force reconciliation
flux reconcile hr <name> -n <namespace> --force
Template Issues¶
Templates Not Rendering¶
task template:validate-schemas # Check cluster.yaml & nodes.yaml
task template:render-configs # Force re-render
Secret Issues¶
Secrets Not Decrypting¶
# Verify age key exists
test -f age.key && echo "Key exists" || echo "Missing key"
# Verify SOPS can decrypt
sops --decrypt bootstrap/sops-age.sops.yaml
# Check SOPS_AGE_KEY_FILE is set
echo $SOPS_AGE_KEY_FILE
Verifying Encryption¶
Node Issues¶
Nodes Not Joining¶
Node Health¶
Pod Issues¶
General Debugging¶
# List pods in namespace
kubectl -n <namespace> get pods -o wide
# Check pod logs
kubectl -n <namespace> logs <pod-name> -f
# Describe pod for events
kubectl -n <namespace> describe pod <pod-name>
# Check namespace events
kubectl -n <namespace> get events --sort-by='.metadata.creationTimestamp'
CrashLoopBackOff¶
- Check logs:
kubectl -n <ns> logs <pod> --previous - Check resource limits:
kubectl -n <ns> describe pod <pod> - Check if NFS-dependent -- add NFS-scaler component if so
- Check if secret is missing:
kubectl -n <ns> get secrets
Pending Pods¶
- Check events:
kubectl -n <ns> describe pod <pod> - Check node resources:
kubectl top nodes - Check PVC binding:
kubectl -n <ns> get pvc - Check node affinity/taints
Network Issues¶
Cilium¶
DNS¶
# Test cluster DNS
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default
# Test external DNS resolution
dig @<k8s-gateway-ip> <app>.<domain>
Storage Issues¶
NFS Unavailable¶
If NFS is down, pods using NFS volumes will crash-loop. The NFS-scaler component handles this automatically for apps that include it.
Check NFS availability:
PVC Issues¶
VM Issues¶
VM Not Starting¶
# Check VirtualMachine status
kubectl -n kubevirt get vm <vm-name>
kubectl -n kubevirt describe vm <vm-name>
# Check VirtualMachineInstance (the running instance)
kubectl -n kubevirt get vmi <vm-name>
# Check CDI DataVolume import status (for new VMs)
kubectl -n kubevirt get dv <vm-name>-disk
kubectl -n kubevirt describe dv <vm-name>-disk
VM Console Not Connecting¶
# Ensure virtctl is installed
virtctl version
# Try direct console access
virtctl console <vm-name> -n kubevirt
# Check VMI is in Running phase
kubectl -n kubevirt get vmi <vm-name> -o jsonpath='{.status.phase}'
VM Live Migration Fails¶
Requirements for live migration:
- Storage must be ReadWriteMany (NFS
nfs-fastclass) - LiveMigration feature gate must be enabled (default)
- Sufficient resources on the target node
- No hostPath mounts
# Check migration status
kubectl -n kubevirt get vmim
kubectl -n kubevirt describe vmim <migration-name>
Database Issues¶
PostgreSQL Cluster Unhealthy¶
# Check cluster status
kubectl -n database get cluster postgres
# Check individual pod status
kubectl -n database get pods -l cnpg.io/cluster=postgres
# View PostgreSQL logs
kubectl -n database logs -l cnpg.io/cluster=postgres --tail=50
# Check replication status
kubectl -n database exec -it postgres-1 -- psql -U postgres -c 'SELECT * FROM pg_stat_replication;'
PostgreSQL Connection Issues¶
# Verify the service exists
kubectl -n database get svc postgres-rw
# Test connectivity from a debug pod
kubectl run -it --rm pg-debug --image=postgres:17 -- \
psql -h postgres-rw.database.svc.cluster.local -U postgres
PostgreSQL Backup Failures¶
# Check scheduled backup status
kubectl -n database get scheduledbackups
kubectl -n database get backups --sort-by='.metadata.creationTimestamp'
# Check WAL archiving
kubectl -n database get cluster postgres -o jsonpath='{.status.firstRecoverabilityPoint}'
Certificate Issues¶
Certificate Not Ready¶
# Check certificate status
kubectl -n network get certificates
kubectl -n network describe certificate <cert-name>
# Check certificate requests
kubectl get certificaterequests -A
# Check cert-manager logs
kubectl -n cert-manager logs -l app.kubernetes.io/name=cert-manager --tail=50
Common Certificate Problems¶
- DNS challenge failing: Check Cloudflare API token permissions
- Rate limited by Let's Encrypt: Wait and retry (check
kubectl describe certificate) - Secret not found: Verify the certificate secret name matches the TLS secret reference in your Gateway
Reset Cluster¶
Danger
This destroys everything. Use as last resort.
After reset, re-bootstrap: