How to Design On-Call Before You Hire SRE

How to Design On-Call Before You Hire SRE

Define rotations, escalation paths, alert rules, and ownership before scaling reliability teams.

Arthur Azrieli
Book Icon - Software Webflow Template
 min read

On-call usually appears before a startup is ready for it. A database alarm fires at 2:13 a.m., a founder gets a customer escalation, and the person who built the service is suddenly the incident commander, database admin, and release engineer at the same time.

You do not need a dedicated Site Reliability Engineering (SRE) team to design sane on-call. You do need clear ownership, actionable alerts, escalation rules, and a process that does not depend on one tired engineer remembering how production works.

Start with service ownership, not a pager schedule

A rotation alone does not create accountability. Before you decide who carries the pager, define what each team or engineer owns in production.

For an early-stage company, this can be simple. You might have one backend team, one frontend team, and one person who understands infrastructure. That is fine, but write it down anyway.

At minimum, define ownership for:

  • Customer-facing services: APIs, web apps, mobile backends, payment flows, authentication, background jobs.
  • Data stores: relational databases, caches, queues, search indexes, object storage.
  • Deployment systems: continuous integration and continuous delivery, often shortened to CI/CD.
  • Cloud infrastructure: networking, compute, Kubernetes clusters, identity and access management, secrets, infrastructure as code.
  • Observability: logs, metrics, traces, dashboards, alert rules, incident tooling.

The key question is simple: if this breaks in production, who knows enough to make the first safe decision?

If the answer is “only one person,” that is a risk you should reduce before the company grows. Pair people on deploys, document recovery steps, and rotate secondary responsibility before you rotate primary on-call.

Define incident severity before the first bad night

Teams without severity levels waste time during incidents. Every alert feels urgent, every Slack message competes for attention, and nobody knows when to wake the CTO.

Keep severity levels practical. You do not need a large enterprise framework. You need shared language.

  • SEV1: A major customer-facing outage, data loss risk, security incident, or failed critical business flow such as login, checkout, or core API access.
  • SEV2: Partial degradation, a major internal system failure, a serious performance issue, or a problem affecting a subset of customers.
  • SEV3: A production issue that needs attention during working hours, such as elevated error rates below customer impact thresholds or a flaky background job.
  • SEV4: A low-risk issue, cleanup task, noisy alert, or known bug that belongs in the backlog.

Then map severity to response expectations:

  • SEV1: Page the primary on-call immediately. Escalate if there is no response within a few minutes.
  • SEV2: Page during agreed coverage hours or notify the on-call channel, depending on impact.
  • SEV3 and SEV4: Create a ticket or send a non-urgent notification. Do not page someone at night.

This removes judgment calls under pressure. If the checkout API returns errors for most users, it is a SEV1. If a staging deployment failed, it is not a page.

Design the rotation around the team you actually have

Early on-call often fails because the rotation assumes more people, more context, and more operational maturity than the company has.

If you have three backend engineers and no platform team, do not copy a large-company rotation. Start with something small and honest:

  • Primary on-call: Receives pages and owns the first response.
  • Secondary on-call: Helps with complex incidents and takes over if the primary is unavailable.
  • Engineering lead escalation: Joins for customer impact, unclear ownership, risky rollback decisions, or repeated failures.

For very small teams, you may need business-hours on-call first, then expand coverage as the product and customer base demand it. Pretending you have 24/7 coverage when one person answers every page creates burnout and hides the real cost of reliability.

Make the schedule visible. Put it where engineers already work, such as your incident tool, calendar, Slack channel, or team wiki. Include handoff notes when the rotation changes:

  • Recent deploys that changed production behavior.
  • Known incidents or degraded dependencies.
  • Temporary alerts or noisy monitors.
  • Upcoming migrations, load tests, or infrastructure changes.

If you are deciding when to formalize this into a dedicated role or team, it helps to think through the operating model early. This guide on how to build a DevOps team gives useful structure for that decision.

Build escalation paths that match real failure modes

Escalation is not a sign that the primary on-call failed. It is how you keep incidents moving when the problem crosses service boundaries or requires risky decisions.

Good escalation rules answer four questions:

  1. Who gets paged first? Usually the service owner or general primary on-call.
  2. When does the alert escalate? For example, no acknowledgement within 5 minutes for SEV1.
  3. Who joins next? Secondary on-call, database owner, infrastructure owner, or engineering lead.
  4. Who can make business-risk decisions? For example, disabling a feature, rolling back a release, rate-limiting customers, or putting the product into read-only mode.

Do not bury escalation rules in tribal knowledge. Put them in the runbook. During an incident, the on-call engineer should not need to ask, “Am I allowed to roll this back?”

Common escalation triggers include:

  • The issue affects paying customers or a critical user flow.
  • The primary cannot identify the failing component within a short window.
  • A rollback may cause data inconsistency.
  • The fix requires cloud, database, Kubernetes, or security access the primary does not have.
  • The same alert fires repeatedly after mitigation.

This is where many startups discover gaps in access control. The person on-call needs enough access to diagnose and mitigate incidents, but not unlimited production privileges by default. Use temporary access, audited admin roles, and documented break-glass procedures where possible.

Make alerts actionable or remove them

Bad alerting breaks on-call faster than a small rotation does. If every CPU spike, pod restart, and failed health check pages someone, engineers will stop trusting the system.

An alert should meet three tests:

  • It represents customer impact or an urgent risk to customer impact.
  • The on-call engineer can take a clear action.
  • It includes enough context to start debugging quickly.

For example, “CPU above 80% for 5 minutes” is often a weak page. “API error rate above the agreed threshold for the login service, affecting production traffic” is better. The first alert reports resource pressure. The second points to user impact.

Each paging alert should include:

  • Service name and environment.
  • Severity level.
  • Dashboard link.
  • Recent deploy link or commit range if available.
  • Runbook link.
  • Suggested first checks.

You should also separate paging alerts from informational notifications. A failed nightly report may need a ticket. A saturated database connection pool during peak traffic may need a page.

If your team already ignores alerts, fix that before expanding the rotation. This article on how to handle alert fatigue covers practical ways to cut noise without missing real incidents.

Create runbooks for the top incidents, then improve them after use

Runbooks do not need to be perfect. They need to help a tired engineer avoid unsafe guesses.

Start with the incidents you can predict:

  • Rollback a bad deployment.
  • Restart a stuck worker or job processor.
  • Scale a service during traffic spikes.
  • Investigate high API error rates.
  • Handle database connection exhaustion.
  • Recover from failed CI/CD deployment steps.
  • Rotate a leaked secret.

A useful runbook should include:

  • Symptoms: What the alert or customer report looks like.
  • First checks: Logs, dashboards, recent deploys, dependency status.
  • Safe mitigations: Rollback, disable a feature flag, scale workers, drain traffic, pause a queue.
  • Risky actions: Anything that can cause data loss, downtime, or security exposure.
  • Escalation: Who to contact and when.
  • Aftercare: What to monitor after mitigation.

Treat runbooks as production code-adjacent material. Keep them close to the systems they describe. Review them after incidents. Delete steps that no longer apply.

Your tooling should support this workflow instead of making it harder. If your team is choosing monitoring, CI/CD, infrastructure as code, or incident tools, this guide on choosing the right DevOps tools can help you avoid buying complexity before you need it.

Review incidents without blame and turn pain into backlog

After an incident, the goal is to improve the system, not find the person who made the last visible mistake.

Keep the review short and specific:

  • What happened?
  • How did customers or internal users experience it?
  • How did we detect it?
  • How long did it take to acknowledge and mitigate?
  • What made diagnosis harder?
  • Which alert, dashboard, runbook, or ownership gap should we fix?

Do not let every incident produce ten cleanup tasks that nobody owns. Pick one or two changes that reduce repeat risk. For example:

  • Add a rollback command to the deployment runbook.
  • Change a noisy CPU alert into a customer-impact alert.
  • Add a dashboard panel for queue age instead of queue length alone.
  • Move ownership of a shared service from “everyone” to a named team.
  • Test database restore steps before you need them.

If you use Azure DevOps or are considering it for delivery workflows, this guide on setting up Azure DevOps for startups may help you connect work tracking, pipelines, and operational follow-up in a cleaner way.

Takeaway

You can design useful on-call before you hire SRE. Start with ownership, severity levels, escalation paths, actionable alerts, and runbooks for the incidents you expect. Keep the system small enough that your team will use it, then improve it after real incidents.

If your current setup depends on one engineer, noisy alerts, or undocumented recovery steps, fix those first. If you want help reviewing your production readiness and on-call design, you can book a free one-click environment consultation call.