MeteorOps | DevOps Dictionary | Kubernetes Liveness Probe

Kubernetes Liveness Probe

Kubernetes Liveness Probe is a health check that tells Kubernetes whether a container is still alive. If the probe fails enough times, the kubelet restarts the container so the application can recover from a stuck, crashed, or deadlocked state.

What a Kubernetes liveness probe does

A liveness probe helps Kubernetes detect containers that are running but no longer working correctly. This matters because a process can stay alive at the operating system level while the application inside it is unusable.

For example, a web server process might still exist, but its request handlers are deadlocked. Without a liveness probe, Kubernetes may keep the broken container running. With a liveness probe, Kubernetes can restart it automatically.

How it works

The kubelet runs the liveness probe on each node where the pod is scheduled. If the check succeeds, Kubernetes leaves the container alone. If the check fails repeatedly, Kubernetes kills the container and restarts it according to the pod’s restart policy.

Kubernetes supports three main liveness probe types:

HTTP probe: Sends an HTTP GET request to a path, such as /healthz. A status code from 200 to 399 is treated as success.
TCP probe: Tries to open a TCP connection to a container port. If the connection succeeds, the probe passes.
Exec probe: Runs a command inside the container. An exit code of 0 means success.

Key configuration fields

A liveness probe is defined in the container spec of a pod, Deployment, StatefulSet, DaemonSet, or another workload object. Common fields include:

initialDelaySeconds: How long Kubernetes waits before running the first probe after the container starts.
periodSeconds: How often Kubernetes runs the probe.
timeoutSeconds: How long Kubernetes waits for a probe response.
failureThreshold: How many failures are allowed before Kubernetes restarts the container.
successThreshold: How many successes are required after a failure. For liveness probes, this must be 1.

Example

This example checks an HTTP endpoint every 10 seconds after a 30-second startup delay. If the endpoint fails 3 times in a row, Kubernetes restarts the container.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: example/api:1.0.0
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 2
            failureThreshold: 3

Common use cases

Recovering from deadlocks: Restarting an app when internal threads or workers stop making progress.
Handling unrecoverable errors: Restarting a container after it enters a bad state that the app cannot fix itself.
Keeping long-running services healthy: Detecting broken API servers, background workers, or queue consumers.
Reducing manual intervention: Letting Kubernetes handle simple restart-based recovery for known failure modes.

Liveness probe vs readiness probe

A liveness probe answers: “Should this container be restarted?”

A readiness probe answers: “Should this pod receive traffic?”

This difference is important. If an app is temporarily unable to serve requests because it is warming a cache or waiting for a dependency, a readiness probe should fail so Kubernetes removes the pod from Service endpoints. A liveness probe should usually keep passing unless the container is truly stuck and needs a restart.

Liveness probe vs startup probe

A startup probe is used for applications that take a long time to start. While the startup probe is running, Kubernetes disables liveness and readiness checks for that container.

This prevents Kubernetes from killing a slow-starting app before it has a chance to become healthy. Once the startup probe succeeds, the liveness probe takes over.

Best practices

Do not check every dependency in a liveness probe. If your database is down, restarting every application pod may make the outage worse.
Use readiness probes for traffic routing. If the pod should stop receiving requests but does not need a restart, readiness is the right check.
Set realistic delays and thresholds. For example, an app that normally starts in 45 seconds should not use a 5-second initial delay with a low failure threshold.
Keep the check cheap. A liveness endpoint should be fast and lightweight. Avoid expensive database queries or external API calls.
Test failure behavior. Confirm that a failed probe causes the expected restart and that restart loops do not hide deeper application issues.

Tradeoffs and limitations

Liveness probes can improve self-healing, but bad probe design can cause instability. An aggressive probe can restart healthy containers during short CPU spikes, garbage collection pauses, or slow disk I/O.

A liveness probe also does not fix the root cause of a failure. It restarts the container, which may clear memory corruption, deadlocks, or stuck connections. If the same bug happens repeatedly, the pod may enter a crash loop and require debugging.

Real-world example

Suppose you run a Go API on Kubernetes. The process can accept TCP connections, but a deadlock prevents request handlers from completing. A TCP liveness probe may still pass because the port is open. An HTTP liveness probe that checks an internal health endpoint may catch the problem more accurately.

In that case, your /healthz endpoint should confirm that the main request loop is responsive, but it should avoid calling every downstream system. If PostgreSQL is unavailable, the pod may need to fail readiness, not liveness.

Where it fits in Kubernetes operations

Liveness probes are part of day-to-day Kubernetes workload management. They are usually defined alongside resource requests, limits, readiness probes, and rollout settings. If you manage manifests with infrastructure as code, you can define probes in the same workflow you use to deploy Kubernetes resources using Terraform.

Probe behavior is also worth checking during cluster version changes. API behavior, controller behavior, and workload timing can shift during upgrades, so teams often review probes as part of Kubernetes upgrade planning.

Summary

A Kubernetes liveness probe checks whether a container should keep running. When the probe fails repeatedly, Kubernetes restarts the container. Used well, it helps workloads recover from stuck states. Used poorly, it can create restart loops and hide application problems. The best liveness probes are simple, fast, and focused on whether the container itself is alive.

DevOps Dictionary