MeteorOps | How to Configure Kubernetes Liveness Probes Without Causing Restart Loops

Kubernetes liveness probes look simple until a deployment starts killing healthy pods. The usual pressure is real: you want broken containers restarted automatically, but a probe with aggressive thresholds can turn a slow startup, temporary CPU throttling, or a downstream outage into a restart loop.

The goal is not to make liveness probes “pass more often.” The goal is to use them only for conditions that a container restart can actually fix, and to give the application enough time to prove it is truly wedged before Kubernetes restarts it.

Understand what a liveness probe is allowed to do

A liveness probe answers one narrow question: “Should Kubernetes restart this container?” If the probe fails enough times, the kubelet restarts the container according to the pod’s restart policy. That is powerful, but it is also blunt.

A liveness probe should detect problems such as:

A process deadlock where the application stops serving all useful work.
A stuck event loop that no longer responds locally.
A broken internal state that only a process restart can clear.
A worker process that is alive at the operating system level but cannot make progress.

A liveness probe should usually avoid checking:

Database reachability.
Message broker availability.
Third-party API status.
Availability of another internal service.
Long-running migrations or warm-up tasks.

If the database is unavailable and every pod fails its liveness probe because of that database check, Kubernetes will restart every pod. The restart will not fix the database. It will add load, increase cold starts, clear in-memory caches, and make recovery harder.

Use the right probe for the right job:

startupProbe: gives slow-starting containers time to initialize before liveness checks begin.
readinessProbe: controls whether the pod receives traffic.
livenessProbe: restarts the container when it is truly unhealthy.

If you manage Kubernetes manifests through infrastructure as code, keep this distinction explicit in your modules or templates. The same principle applies whether you apply raw YAML, use Helm, or deploy Kubernetes resources using Terraform.

Start with a safe baseline configuration

Begin with conservative thresholds, then tighten them after observing real startup time and failure behavior. Do not copy probe values across services without checking how each service starts, warms up, and handles load.

Here is a risky liveness probe:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: example/api:1.0.0
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 1
            failureThreshold: 3

This can restart the container after roughly 20 seconds: 5 seconds of initial delay, then 3 failed probes spaced 5 seconds apart. If the service sometimes needs 45 seconds to load configuration, run migrations, compile templates, or warm caches, this pod can restart forever.

A safer pattern separates startup, readiness, and liveness:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: example/api:1.0.0
          ports:
            - name: http
              containerPort: 8080

          startupProbe:
            httpGet:
              path: /startupz
              port: http
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 24

          readinessProbe:
            httpGet:
              path: /readyz
              port: http
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 3
            successThreshold: 1

          livenessProbe:
            httpGet:
              path: /livez
              port: http
            periodSeconds: 10
            timeoutSeconds: 2
            failureThreshold: 3

In this example, the startup probe allows up to about 120 seconds for startup: 24 failures multiplied by a 5-second period. While the startup probe is still failing, Kubernetes does not run the liveness probe. After startup succeeds, liveness begins.

This matters for applications with variable boot times, such as Java services, applications that load large models, services that hydrate caches, or workers that recover state during startup.

Use practical threshold math before shipping

You should be able to explain every probe value in a deployment review. If the answer is “we copied it,” the probe is not tuned.

Use this rough calculation:

restart window ≈ initialDelaySeconds + (failureThreshold × periodSeconds)

For a liveness probe like this:

livenessProbe:
  httpGet:
    path: /livez
    port: http
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3

Kubernetes may restart the container after about 60 seconds of failing liveness checks. That may be fine for a stateless API with a 10-second normal startup time. It may be too aggressive for a batch worker that can pause during garbage collection, checkpoint recovery, or CPU pressure.

Use these starting points as a practical baseline, then adjust with real data:

startupProbe: set failureThreshold × periodSeconds higher than your slow but acceptable startup time. If p95 startup is 70 seconds, start around 120 seconds.
readinessProbe: keep it responsive. A period of 5 to 10 seconds is common because it controls traffic routing.
livenessProbe: keep it less aggressive than readiness. A period of 10 to 30 seconds with a failure threshold of 3 is a safer default for many services.
timeoutSeconds: avoid the default of 1 second for applications that can pause briefly under load. Start with 2 or 3 seconds if you have no better data.
initialDelaySeconds: prefer startupProbe for startup handling. Use initialDelaySeconds only when startup behavior is simple and predictable.

Probe tuning should live next to deployment configuration, not in someone’s notes. If your platform team provisions services and dependencies through Kubernetes-native control planes, document probe defaults in the same place you manage patterns such as deploying AWS resources using Crossplane on Kubernetes.

Design health endpoints that match probe intent

The endpoint behind the probe matters more than the YAML. A clean probe configuration still fails if /health does too much work.

A good pattern is to expose separate endpoints:

/livez: checks that the process can respond and has not entered a fatal internal state.
/readyz: checks whether the application can serve traffic right now.
/startupz: checks whether initialization has completed.

For example, an HTTP service might implement behavior like this:

GET /livez
200 OK if the process event loop is responsive
500 only if the process is internally unrecoverable

GET /readyz
200 OK if the service can accept traffic
503 if required dependencies are unavailable or the app is draining

GET /startupz
200 OK after bootstrapping is complete
503 while migrations, cache loading, or state recovery are still running

Here is a minimal Node.js example that keeps liveness local and pushes dependency checks into readiness:

import express from "express";

const app = express();

let started = false;
let shuttingDown = false;

async function checkDatabase() {
  // Replace with a cheap ping or connection-pool status check.
  // Do not run expensive queries here.
  return true;
}

app.get("/startupz", (req, res) => {
  if (started) {
    return res.status(200).send("ok");
  }

  return res.status(503).send("starting");
});

app.get("/livez", (req, res) => {
  // Keep this local. Do not check the database, cache, or third-party APIs.
  return res.status(200).send("ok");
});

app.get("/readyz", async (req, res) => {
  if (shuttingDown || !started) {
    return res.status(503).send("not ready");
  }

  const databaseOk = await checkDatabase();

  if (!databaseOk) {
    return res.status(503).send("database unavailable");
  }

  return res.status(200).send("ok");
});

process.on("SIGTERM", () => {
  shuttingDown = true;
  setTimeout(() => process.exit(0), 10_000);
});

app.listen(8080, async () => {
  // Run startup work here.
  started = true;
});

For worker services without HTTP servers, an exec probe can work, but use it carefully. Every probe spawns a process inside the container. At scale, expensive exec probes can add measurable overhead.

livenessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - test -f /tmp/worker-alive
  periodSeconds: 15
  timeoutSeconds: 2
  failureThreshold: 4

If you use an exec probe, keep the command fast and deterministic. Avoid commands that call external systems, perform file tree scans, invoke package managers, or depend on shells that may not exist in minimal images.

Roll out probe changes with evidence

Treat probe changes like production behavior changes. They can cause restarts, remove pods from service, and change rollout timing.

Use this rollout process:

Measure current startup time. Check container logs, application metrics, and Kubernetes events. Record typical and slow startup durations.
Add or tune startupProbe. Make sure the allowed startup window exceeds slow but valid startup.
Split liveness and readiness endpoints. Keep restart logic separate from traffic-routing logic.
Deploy to a small scope first. Use one environment, one namespace, or one replica set before changing every workload.
Watch events during rollout. Confirm that restarts are not increasing unexpectedly.
Load test or replay traffic if possible. Validate that probes still pass during CPU pressure, garbage collection pauses, and dependency latency.

Useful commands:

kubectl get pods -n app

kubectl describe pod -n app api-7c9f6d8f7f-x2abc

kubectl logs -n app api-7c9f6d8f7f-x2abc --previous

kubectl get events -n app --sort-by=.lastTimestamp

Look for messages like:

Liveness probe failed: HTTP probe failed with statuscode: 503
Back-off restarting failed container
Readiness probe failed: Get "http://10.0.1.25:8080/readyz": context deadline exceeded

If --previous logs show that the application was still starting when it was killed, your startup window is too short or your liveness probe is starting too early. If logs show request latency spikes before liveness failures, your timeout may be too low or the endpoint may share overloaded application threads.

When Kubernetes is part of a larger platform rollout, probe settings should be reviewed with resource requests, limits, rollout strategy, and dependency provisioning. For example, an application deployed on Amazon Elastic Kubernetes Service (EKS) can still fail because a probe ignores slow boot behavior, even if the cluster itself is healthy. The same operational checks apply when you deploy Apache Airflow on AWS EKS or run a custom API service.

Avoid the common liveness probe traps

Most restart loops come from a few repeatable mistakes.

Trap 1: Dependency checks in liveness

If /livez fails when PostgreSQL, Redis, Kafka, or an external API is unavailable, you are asking Kubernetes to restart your application because another system has a problem.

Move dependency checks to readiness:

readinessProbe:
  httpGet:
    path: /readyz
    port: http
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 3

livenessProbe:
  httpGet:
    path: /livez
    port: http
  periodSeconds: 15
  timeoutSeconds: 2
  failureThreshold: 4

Trap 2: No startup probe for slow applications

initialDelaySeconds can work for simple services, but it is a fixed delay. It does not adapt to real startup progress. A startupProbe lets Kubernetes wait until startup succeeds, then begins liveness checks.

Use startup probes for services with:

Large dependency injection graphs.
Database migrations or schema checks.
Cache warm-up.
Model loading.
State recovery after shutdown.

Trap 3: Timeout too low under CPU pressure

The default timeoutSeconds is 1 second. That can be too low for applications under CPU throttling, garbage collection, or temporary I/O pressure.

If probes fail during load but the application recovers without a restart, increase timeoutSeconds, increase failureThreshold, or make the probe endpoint cheaper. Also check CPU limits. A pod with a tight CPU limit can fail probes because the process cannot get scheduled quickly enough.

Trap 4: Redirects, auth, or middleware on health routes

Health endpoints should avoid authentication middleware, redirects, rate limits, and expensive request logging. If your probe receives a 301, 302, 401, or 403, the kubelet may treat it as a failure depending on the probe behavior and response.

Make health routes boring:

Return directly from the application.
Do not require tokens.
Do not redirect HTTP to HTTPS inside the pod unless the probe is configured for HTTPS.
Do not call external services from liveness.

Trap 5: Probe port does not match the container

Named ports reduce mistakes when container ports change:

ports:
  - name: http
    containerPort: 8080

livenessProbe:
  httpGet:
    path: /livez
    port: http

If you use service meshes or sidecars, confirm whether the kubelet probes the application container directly or whether probe rewriting is active. Misconfigured sidecar behavior can make a healthy application look unhealthy.

Use this final checklist before merging

Before you merge a liveness probe change, verify these points:

/livez checks only local process health.
/readyz handles dependency checks and traffic eligibility.
/startupz exists for slow or variable startup.
The startup window is longer than slow but valid startup time.
The liveness restart window is long enough to survive short pauses.
timeoutSeconds is realistic for your runtime and CPU limits.
Health endpoints bypass auth, redirects, and expensive middleware.
Probe failures are visible in logs, metrics, or Kubernetes events.
The rollout plan limits blast radius.

Liveness probes are useful when they restart containers that cannot recover on their own. They become dangerous when they replace readiness checks, dependency monitoring, or startup handling. Start conservative, separate probe responsibilities, watch real failure events, and tune based on observed behavior rather than copied defaults.

This is also a heading
This is a heading