DevOps Dictionary

SRE (Site Reliability Engineering)

SRE (Site Reliability Engineering) is an approach to operating and improving software services by applying engineering methods to reliability, performance, and availability in production. It tackles the challenge that as systems scale, manual operations and ad hoc fixes stop working consistently, so SRE teams set measurable reliability targets (service level objectives, or SLOs), instrument services with monitoring and alerting, automate repetitive operational work, and run structured incident response followed by blameless post-incident reviews to prevent repeat failures. With SRE (Site Reliability Engineering), teams balance shipping features with protecting reliability using data, automation, and clear thresholds, leading to fewer outages and faster recovery; without it, reliability becomes reactive, dependent on individual heroics, and incidents tend to be detected later and resolved less predictably.

A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
X
Z