Most Kubernetes incidents aren’t exotic. The same handful of failures account for the majority of “my pod won’t start” tickets — and each one has a recognizable signature and a well-worn fix. This guide walks through the errors you’ll actually hit, how to read them with kubectl, and how to resolve them.
For every error below the diagnostic loop is the same: look at the pod’s status, then its events, then its logs. Keep these three close:
kubectl get pods -n <namespace>
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
CrashLoopBackOff
The most infamous one. CrashLoopBackOff doesn’t mean Kubernetes is broken — it means your container keeps starting, exiting, and Kubernetes is backing off before restarting it again (the delay grows up to five minutes).
Diagnose. The exit reason is in the logs of the previous run, not the current one:
kubectl logs <pod> -n <namespace> --previous
kubectl describe pod <pod> -n <namespace> # check Last State + Exit Code
Common causes and fixes.
- Application error on startup — a missing env var, an unreachable database, a bad migration. The logs almost always say which. Fix the config or the dependency.
- Failing liveness probe — if the probe fails, the kubelet kills the container and it loops. See the probe section below.
- Exit code 137 — that’s an OOM kill, covered next.
- Misconfigured command/args — the entrypoint exits immediately. Verify
commandandargsin the spec.
ImagePullBackOff and ErrImagePull
The container never even starts because the cluster can’t pull the image.
Diagnose.
kubectl describe pod <pod> -n <namespace> # read the Events at the bottom
The event message is specific: not found, unauthorized, or manifest unknown.
Common causes and fixes.
- Typo in the image name or tag —
myapp:lateset. Fix the tag. - Private registry without credentials — create an
imagePullSecretand reference it in the pod spec (or the service account). - Tag doesn’t exist — the build never pushed, or you’re pinning a tag that was deleted. Push it or pin a digest.
- Rate limiting (Docker Hub) — authenticate or mirror the image.
OOMKilled
Your container asked for more memory than its limit allowed, so the kernel killed it. You’ll see OOMKilled and exit code 137.
Diagnose.
kubectl describe pod <pod> -n <namespace> # Last State: Terminated, Reason: OOMKilled
kubectl top pod <pod> -n <namespace> # live usage vs limits
Fixes.
- Raise the memory limit if the workload legitimately needs more.
- Fix the leak if usage climbs forever — a limit just delays the inevitable.
- Set requests close to real usage so the scheduler places the pod on a node that can actually hold it.
A useful rule of thumb: set the request to the steady-state working set and the limit to the peak you can tolerate. Limits far above requests invite noisy-neighbor evictions.
Pods stuck in Pending
A Pending pod hasn’t been scheduled to any node. The scheduler is telling you why in the events.
Diagnose.
kubectl describe pod <pod> -n <namespace> # FailedScheduling event explains it
Common causes and fixes.
- Insufficient CPU/memory — no node has room for the pod’s requests. Scale the cluster, lower the requests, or free capacity.
- Unsatisfiable node affinity / selectors / taints — the pod demands a node that doesn’t exist or won’t tolerate it. Fix the
nodeSelector, affinity rules, or add a toleration. - Unbound PersistentVolumeClaim — the pod waits on storage. Check the PVC (next).
PVC stuck in Pending
A PersistentVolumeClaim that never binds blocks every pod that mounts it.
kubectl get pvc -n <namespace>
kubectl describe pvc <claim> -n <namespace>
Usually it’s a missing or mismatched StorageClass, or no provisioner able to satisfy the request. Point the claim at a valid StorageClass, or provision a matching PersistentVolume.
CreateContainerConfigError — missing ConfigMap or Secret
The pod is admitted but the container never starts, because it references a ConfigMap or Secret that doesn’t exist. The kubelet reports CreateContainerConfigError with a message like configmap "app-config" not found.
kubectl describe pod <pod> -n <namespace> # Waiting reason + which object is missing
This isn’t a crash loop — the process never runs. The fix is always to (re)create the missing dependency, not to touch the workload. Apply the ConfigMap/Secret and the container starts on the next reconcile.
Failing readiness and liveness probes
Probes are how Kubernetes knows whether your container is alive and ready to serve. They fail differently:
- Readiness probe failing — the container runs but the
Readycondition staysFalse, so the Service keeps it out of rotation. Symptom: “running but not serving traffic.” Check the endpoint, the path, and the timing — a slow starter needs astartupProbeor a longerinitialDelaySeconds. - Liveness probe failing — the kubelet decides the container is unhealthy and restarts it, which can masquerade as a crash loop. If the app is just slow to warm up, your liveness probe is too aggressive.
kubectl describe pod <pod> -n <namespace> # "Liveness/Readiness probe failed" events
Tune the probe to the app’s real behavior: realistic initialDelaySeconds, a startupProbe for slow boots, and a probe endpoint that reflects genuine health rather than a 200 from a static handler.
From symptom to fix, automatically
Every error above has a deterministic signature — a status, a reason, a threshold. That’s exactly what an automated checker is good at. KubeBolt’s Insights Engine ships 24 built-in rules that watch for these patterns continuously (CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending, missing config, probe failures and more) and turn each one into a plain-language recommendation — no PromQL, no dashboards to wire up.
When you want to act on a finding, Kobi proposes the exact fix and can execute it with your approval — restart, scale, roll back, set image — under RBAC, with a full audit trail. You go from “CrashLoopBackOff” to a resolved incident without leaving the dashboard.
Want to try it on your own cluster? It installs in under two minutes.