Troubleshooting¶

Flux Issues¶

Resources Not Syncing¶

# Check Flux health
flux check

# Check for failed reconciliations
flux get ks -A --status-selector ready=false
flux get hr -A --status-selector ready=false

# View Flux logs
flux logs --all-namespaces

# Force sync
task reconcile

HelmRelease Stuck¶

# Suspend and resume
flux suspend hr <name> -n <namespace>
flux resume hr <name> -n <namespace>

# Force reconciliation
flux reconcile hr <name> -n <namespace> --force

Template Issues¶

Templates Not Rendering¶

task template:validate-schemas    # Check cluster.yaml & nodes.yaml
task template:render-configs      # Force re-render

Secret Issues¶

Secrets Not Decrypting¶

# Verify age key exists
test -f age.key && echo "Key exists" || echo "Missing key"

# Verify SOPS can decrypt
sops --decrypt bootstrap/sops-age.sops.yaml

# Check SOPS_AGE_KEY_FILE is set
echo $SOPS_AGE_KEY_FILE

Verifying Encryption¶

# All .sops.yaml files should contain 'sops:' metadata
grep -l "sops:" kubernetes/**/*.sops.yaml

Node Issues¶

Nodes Not Joining¶

talosctl get members --nodes <ip> --insecure
talosctl logs --nodes <ip> --insecure

Node Health¶

kubectl get nodes -o wide
kubectl describe node <node-name>

Pod Issues¶

General Debugging¶

# List pods in namespace
kubectl -n <namespace> get pods -o wide

# Check pod logs
kubectl -n <namespace> logs <pod-name> -f

# Describe pod for events
kubectl -n <namespace> describe pod <pod-name>

# Check namespace events
kubectl -n <namespace> get events --sort-by='.metadata.creationTimestamp'

CrashLoopBackOff¶

Check logs: kubectl -n <ns> logs <pod> --previous
Check resource limits: kubectl -n <ns> describe pod <pod>
Check if NFS-dependent -- add NFS-scaler component if so
Check if secret is missing: kubectl -n <ns> get secrets

Pending Pods¶

Check events: kubectl -n <ns> describe pod <pod>
Check node resources: kubectl top nodes
Check PVC binding: kubectl -n <ns> get pvc
Check node affinity/taints

Network Issues¶

Cilium¶

cilium status
cilium connectivity test

DNS¶

# Test cluster DNS
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

# Test external DNS resolution
dig @<k8s-gateway-ip> <app>.<domain>

Storage Issues¶

NFS Unavailable¶

If NFS is down, pods using NFS volumes will crash-loop. The NFS-scaler component handles this automatically for apps that include it.

Check NFS availability:

kubectl -n observability get prometheusrule -l app=blackbox-exporter

PVC Issues¶

kubectl get pvc -A
kubectl describe pvc <name> -n <namespace>

VM Issues¶

VM Not Starting¶

# Check VirtualMachine status
kubectl -n kubevirt get vm <vm-name>
kubectl -n kubevirt describe vm <vm-name>

# Check VirtualMachineInstance (the running instance)
kubectl -n kubevirt get vmi <vm-name>

# Check CDI DataVolume import status (for new VMs)
kubectl -n kubevirt get dv <vm-name>-disk
kubectl -n kubevirt describe dv <vm-name>-disk

VM Console Not Connecting¶

# Ensure virtctl is installed
virtctl version

# Try direct console access
virtctl console <vm-name> -n kubevirt

# Check VMI is in Running phase
kubectl -n kubevirt get vmi <vm-name> -o jsonpath='{.status.phase}'

VM Live Migration Fails¶

Requirements for live migration:

Storage must be ReadWriteMany (NFS nfs-fast class)
LiveMigration feature gate must be enabled (default)
Sufficient resources on the target node
No hostPath mounts

# Check migration status
kubectl -n kubevirt get vmim
kubectl -n kubevirt describe vmim <migration-name>

Database Issues¶

PostgreSQL Cluster Unhealthy¶

# Check cluster status
kubectl -n database get cluster postgres

# Check individual pod status
kubectl -n database get pods -l cnpg.io/cluster=postgres

# View PostgreSQL logs
kubectl -n database logs -l cnpg.io/cluster=postgres --tail=50

# Check replication status
kubectl -n database exec -it postgres-1 -- psql -U postgres -c 'SELECT * FROM pg_stat_replication;'

PostgreSQL Connection Issues¶

# Verify the service exists
kubectl -n database get svc postgres-rw

# Test connectivity from a debug pod
kubectl run -it --rm pg-debug --image=postgres:17 -- \
  psql -h postgres-rw.database.svc.cluster.local -U postgres

PostgreSQL Backup Failures¶

# Check scheduled backup status
kubectl -n database get scheduledbackups
kubectl -n database get backups --sort-by='.metadata.creationTimestamp'

# Check WAL archiving
kubectl -n database get cluster postgres -o jsonpath='{.status.firstRecoverabilityPoint}'

Certificate Issues¶

Certificate Not Ready¶

# Check certificate status
kubectl -n network get certificates
kubectl -n network describe certificate <cert-name>

# Check certificate requests
kubectl get certificaterequests -A

# Check cert-manager logs
kubectl -n cert-manager logs -l app.kubernetes.io/name=cert-manager --tail=50

Common Certificate Problems¶

DNS challenge failing: Check Cloudflare API token permissions
Rate limited by Let's Encrypt: Wait and retry (check kubectl describe certificate)
Secret not found: Verify the certificate secret name matches the TLS secret reference in your Gateway

Reset Cluster¶

Danger

This destroys everything. Use as last resort.

task talos:reset

After reset, re-bootstrap:

task bootstrap:talos
task bootstrap:apps