How to Know When to Hire DevOps Consultants
Define DevOps consulting scope by outcomes, ownership, and handoff readiness.
Kubernetes resource requests and limits often get tuned during an incident, when pods are pending, nodes are packed too tightly, or latency jumps after a rollout. The usual pressure is simple: stop wasting capacity without starving the application. The hard part is that CPU and memory behave very differently, and a safe-looking limit can create throttling long before a node is actually busy.
This guide walks through a practical way to set requests and limits, check for throttling, and adjust workloads without turning every deployment into a resource guessing game.
Before tuning numbers, separate scheduling behavior from runtime behavior.
CPU is compressible. Kubernetes can slow a container down. Memory is not. If the process needs memory and cannot get it, it may be killed.
That distinction drives most good defaults:
If the app already runs in Kubernetes, collect usage and throttling data before changing manifests. A few minutes of data can be misleading. Use a window that includes startup, normal traffic, background jobs, and any scheduled bursts.
Start with basic checks:
kubectl top pod -n production
kubectl top pod -n production --containers
kubectl describe pod -n production my-app-abc123
Look for current requests, limits, restarts, OOM kills, and scheduling events:
kubectl get pod -n production my-app-abc123 -o jsonpath='{range .spec.containers[*]}{.name}{" requests="}{.resources.requests}{" limits="}{.resources.limits}{"\n"}{end}'
kubectl get pod -n production my-app-abc123 -o jsonpath='{range .status.containerStatuses[*]}{.name}{" restartCount="}{.restartCount}{" lastState="}{.lastState}{"\n"}{end}'
If you use Prometheus with kubelet and cAdvisor metrics, check CPU usage and throttling. Metric names can vary by Kubernetes version, runtime, and scrape configuration, so confirm them in your own Prometheus first.
CPU usage by container:
sum by (namespace, pod, container) (
rate(container_cpu_usage_seconds_total{
namespace="production",
container!="",
image!=""
}[5m])
)
CPU throttling ratio:
sum by (namespace, pod, container) (
rate(container_cpu_cfs_throttled_periods_total{
namespace="production",
container!="",
image!=""
}[5m])
)
/
sum by (namespace, pod, container) (
rate(container_cpu_cfs_periods_total{
namespace="production",
container!="",
image!=""
}[5m])
)
Memory working set:
max by (namespace, pod, container) (
container_memory_working_set_bytes{
namespace="production",
container!="",
image!=""
}
)
OOM kills from kube-state-metrics, if available:
kube_pod_container_status_last_terminated_reason{
namespace="production",
reason="OOMKilled"
}
Do not tune from average CPU alone. A service can average low CPU while still getting throttled during request bursts, garbage collection, TLS handshakes, JSON serialization, or short background tasks.
CPU limits are the most common source of surprise. A container can be throttled even when the node still has idle CPU, because the limit is enforced at the container cgroup level.
A practical CPU process:
Example for a service that usually needs around 250 millicores and bursts higher during traffic spikes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-api
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: checkout-api
template:
metadata:
labels:
app: checkout-api
spec:
containers:
- name: checkout-api
image: example.com/checkout-api:1.2.3
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
memory: "768Mi"
This example intentionally omits a CPU limit. That does not mean every workload should omit CPU limits. It means you should avoid adding tight CPU quotas to services that need short bursts to keep latency stable.
CPU limits may still make sense for:
If your organization requires CPU limits, start with a limit higher than the request and validate it with throttling data:
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "768Mi"
After rollout, watch throttling and application latency together. Throttling by itself is not always an outage, but throttling plus rising p95 or p99 latency is a clear signal that the CPU limit is too low or the app needs more replicas.
Memory tuning has a different failure mode. If memory usage exceeds the limit, Kubernetes does not slow the process down. The container can be killed. If this happens during a rollout, new pods may never become ready and the deployment can stall.
A practical memory process:
Example:
resources:
requests:
cpu: "300m"
memory: "1Gi"
limits:
memory: "1536Mi"
For memory-heavy workloads, do not copy limits between services just because the images look similar. A Java API, a Node.js API, and a Python worker can have very different memory behavior even when they serve the same product area.
If you see OOM kills, confirm whether the container exceeded its own limit or the node was under pressure:
kubectl describe pod -n production my-app-abc123
kubectl get events -n production \
--field-selector involvedObject.name=my-app-abc123 \
--sort-by=.lastTimestamp
Look for output such as:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Common fixes:
Kubernetes assigns a Quality of Service (QoS) class to each pod based on requests and limits. This affects eviction order when a node runs out of resources.
Most production application pods should be Burstable. That lets you set realistic requests while still allowing CPU bursting if you omit CPU limits or set them higher than requests.
Check a pod QoS class:
kubectl get pod -n production my-app-abc123 -o jsonpath='{.status.qosClass}{"\n"}'
Avoid BestEffort for production services. The scheduler has no useful resource signal, and the pod is a strong eviction candidate under pressure.
Guaranteed can work for critical infrastructure components when you know exact needs and want strict reservation. It can also waste capacity if you set high CPU limits equal to high CPU requests just to reach the class.
Changing resource settings can affect scheduling, rollout speed, autoscaling, and node pressure. Treat it like a production change, not a YAML cleanup.
A safe workflow:
Patch a deployment quickly during investigation:
kubectl set resources deployment/checkout-api \
-n production \
-c checkout-api \
--requests=cpu=300m,memory=768Mi \
--limits=memory=1Gi
Then verify the generated pod template:
kubectl get deployment checkout-api -n production -o yaml
kubectl rollout status deployment/checkout-api -n production
kubectl get pods -n production -l app=checkout-api
For repeatable changes, update the source manifest, Helm values, Kustomize patch, or Terraform code instead of leaving a manual patch in the cluster. If your team manages Kubernetes objects through Terraform, keep resource settings close to the deployment definition and review the plan before applying. This fits the same workflow used to deploy Kubernetes resources using Terraform.
Example Helm values pattern:
resources:
requests:
cpu: 300m
memory: 768Mi
limits:
memory: 1Gi
Example Kustomize patch:
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-api
spec:
template:
spec:
containers:
- name: checkout-api
resources:
requests:
cpu: "300m"
memory: "768Mi"
limits:
memory: "1Gi"
If you are provisioning cloud resources from inside Kubernetes, keep the boundary clear. Crossplane can manage cloud infrastructure, while standard Kubernetes controllers manage pods and deployments. The same separation applies when you deploy AWS resources using Crossplane on Kubernetes: infrastructure capacity and pod resource settings should be reviewed together, but they are still different control loops.
Requests drive scheduling, and many autoscaling setups use resource metrics. Bad requests can make autoscaling noisy or ineffective.
For a Horizontal Pod Autoscaler (HPA), CPU utilization is commonly calculated relative to CPU requests. If the request is too low, utilization looks high and the HPA may scale out too aggressively. If the request is too high, utilization looks low and the HPA may not add replicas soon enough.
Example HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-api
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
minReplicas: 3
maxReplicas: 12
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Before applying an HPA, confirm every target container has a CPU request. Without CPU requests, CPU utilization-based scaling cannot work correctly.
Check requests across a namespace:
kubectl get pods -n production -o custom-columns='POD:.metadata.name,CONTAINERS:.spec.containers[*].name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory,CPU_LIMIT:.spec.containers[*].resources.limits.cpu,MEM_LIMIT:.spec.containers[*].resources.limits.memory'
Also check whether your requests fit available node capacity:
kubectl describe nodes | grep -A 8 "Allocated resources"
Gotchas to watch:
Inspect namespace policies:
kubectl get resourcequota -n production
kubectl describe resourcequota -n production
kubectl get limitrange -n production
kubectl describe limitrange -n production
For heavier platform workloads such as workflow schedulers, resource settings deserve extra care because web servers, schedulers, and workers often need different profiles. If you are running Airflow on Kubernetes, review each component separately rather than applying one resource block everywhere. The same principle applies when you deploy Apache Airflow on Amazon Elastic Kubernetes Service.
When you need a practical default, use this model:
A good starting manifest for many production services looks like this:
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
memory: "768Mi"
A stricter shared-cluster version may look like this:
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "768Mi"
Do not treat either block as universal. Treat it as a starting point, then validate with real usage, throttling, restarts, scheduling events, and application behavior.
Set requests to help Kubernetes schedule pods based on normal resource needs. Set memory limits to prevent one container from consuming too much memory. Be careful with CPU limits because they can throttle applications during short bursts even when the node has spare CPU.
The safest path is measured and repeatable: collect usage data, set realistic requests, avoid tight CPU limits for latency-sensitive services, leave memory headroom, roll changes through your normal delivery process, and verify the result with both Kubernetes metrics and application metrics.