DevOps Dictionary

Prometheus

Prometheus is an open-source monitoring and alerting toolkit that collects time-series metrics from applications, infrastructure, and services. It stores those metrics with timestamps and labels, then lets teams query them, build dashboards, and trigger alerts when systems behave unexpectedly.

What Prometheus does

Prometheus helps teams understand the health and performance of distributed systems. It is commonly used in Kubernetes, cloud-native platforms, microservices, and infrastructure monitoring.

Teams use Prometheus to track signals such as:

  • HTTP request rate, latency, and error count
  • CPU, memory, disk, and network usage
  • Container and pod status in Kubernetes
  • Queue depth and job processing time
  • Database connections, query duration, and replication lag
  • Custom business or application metrics, such as orders processed per minute

How Prometheus works

Prometheus usually uses a pull model. It scrapes metrics from configured HTTP endpoints at regular intervals, such as every 15 or 30 seconds. Each scrape returns metric data in a text-based format that Prometheus stores in its time-series database.

Each metric can include labels, which are key-value pairs that describe the source or type of measurement. For example, an HTTP request counter might include labels for method, status, service, and instance.

Prometheus queries use PromQL, a query language designed for time-series data. PromQL can calculate rates, averages, percentiles, error ratios, and other derived signals from raw metrics.

Key parts of Prometheus

  • Prometheus server: Scrapes, stores, and queries metrics.
  • Exporters: Small services that expose metrics for systems that do not provide Prometheus metrics directly, such as Node Exporter for Linux host metrics.
  • Client libraries: Language-specific libraries for instrumenting application code in Go, Java, Python, Ruby, and other languages.
  • PromQL: The query language used to analyze metrics.
  • Alerting rules: Conditions that evaluate metric queries and fire alerts when thresholds are met.
  • Alertmanager: A companion component that groups, deduplicates, silences, and routes alerts to tools such as Slack, PagerDuty, email, or webhooks.
  • Service discovery: A way to find scrape targets dynamically, often through Kubernetes, Consul, EC2, or static configuration.

Common use cases

  • Application monitoring: Track request latency, throughput, errors, and saturation for APIs and background workers.
  • Kubernetes monitoring: Observe pods, nodes, containers, deployments, and control plane components.
  • Infrastructure monitoring: Collect metrics from Linux hosts, databases, load balancers, message queues, and storage systems.
  • SLO monitoring: Measure service level indicators such as availability, latency, and error rate against reliability targets.
  • Alerting: Notify teams when systems breach thresholds or show signs of failure.
  • Capacity planning: Review usage trends to decide when to add resources or tune workloads.

Simple example

A team runs a checkout API in Kubernetes. The application exposes a /metrics endpoint with counters and histograms for request volume, errors, and duration. Prometheus scrapes that endpoint every 15 seconds.

The team writes a PromQL query to calculate the percentage of failed checkout requests over five minutes. If the error rate stays above 5% for 10 minutes, Prometheus fires an alert. Alertmanager sends the alert to the on-call engineer with labels showing the affected service, namespace, and severity.

Benefits

  • Strong fit for cloud-native systems: Prometheus works well with dynamic environments where instances, containers, and pods change often.
  • Powerful querying: PromQL supports practical calculations for reliability, performance, and capacity analysis.
  • Simple instrumentation model: Applications can expose metrics over HTTP using standard client libraries.
  • Large integration base: Many databases, message brokers, operating systems, and cloud tools have existing exporters.
  • Open-source and widely adopted: Teams can run it themselves and integrate it with common DevOps workflows.

Tradeoffs and limitations

  • High-cardinality labels can be expensive: Labels such as user ID, request ID, or session ID can create too many time series and cause performance problems.
  • Long-term storage is not its main strength: Prometheus stores local time-series data, but many teams add systems such as Thanos, Cortex, or Mimir for long retention and global querying.
  • Pull-based scraping needs reachable targets: Prometheus must be able to connect to scrape endpoints, which can require network and service discovery planning.
  • Metrics are not logs or traces: Prometheus shows numeric trends and states, but it does not replace log search or distributed tracing tools.
  • Alert quality depends on rule design: Poor thresholds can create noisy alerts, missed incidents, or alerts without useful context.

Prometheus vs nearby terms

  • Prometheus vs Grafana: Prometheus collects, stores, and queries metrics. Grafana is commonly used to visualize those metrics in dashboards.
  • Prometheus vs Alertmanager: Prometheus evaluates alert rules. Alertmanager handles alert routing, grouping, silencing, and notification delivery.
  • Prometheus vs OpenTelemetry: OpenTelemetry is a broader observability framework for metrics, logs, and traces. Prometheus is focused on metrics and alerting, though the two can work together.
  • Prometheus vs logs: Prometheus answers questions such as “What is the error rate?” Logs help answer questions such as “What happened in this specific request?”

Good practices

  • Use stable, low-cardinality labels such as service, route, method, status, region, and instance.
  • Avoid labels with unbounded values, such as user ID, email address, transaction ID, or full URL path with IDs.
  • Track the main reliability signals: latency, traffic, errors, and saturation.
  • Write alerts for user impact, not every minor internal condition.
  • Pair metrics with logs and traces when debugging production incidents.
  • Review alert rules regularly to remove noise and improve response quality.
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
X
Z