AlgoMaster Logo

Dashboards & Runbooks

Last Updated: January 7, 2026

Ashish

Ashish Pratap Singh

You have metrics, logs, traces, and alerts. But at 3 AM when an alert fires, the on-call engineer needs to know what to do. They need to quickly understand the system state, identify the problem, and take action. Raw data is not enough.

Dashboards transform raw metrics into actionable visualizations. They answer questions at a glance: Is the system healthy? What changed? Where is the problem?

Runbooks transform knowledge into repeatable procedures. They answer: What do I do when this alert fires? They capture the expertise of senior engineers and make it available to whoever is on call. Without runbooks, every incident requires discovering the fix from scratch.

In this chapter, you will learn:

  • How to design effective dashboards
  • Dashboard hierarchy and organization
  • Creating useful runbooks
  • Incident response procedures
  • Connecting dashboards, runbooks, and alerts

This chapter brings together everything we have covered. Dashboards and runbooks are how your team actually uses observability data during incidents.

Dashboard Hierarchy

Dashboards work best when they are organized in layers, from a high-level overview to deep investigation. The goal is to help someone answer questions quickly without drowning in charts.

A good hierarchy follows how incidents actually unfold:

Level 1: Executive Overview

This is the single screen you open first. It should tell you, at a glance, whether the system is healthy and whether you are meeting SLOs.

Example layout

Characteristics:

  • fits on one screen, no scrolling
  • minimal charts, mostly summary numbers and status
  • focuses on outcomes, not internals
  • makes it obvious where to drill down next

This dashboard is for everyone: on-call engineers, team leads, even non-engineering stakeholders.

Level 2: Service Health

Once you know something is wrong, the next question is “where?” This dashboard should show all services side by side using the same handful of metrics.

Characteristics:

  • consistent metrics across services (same definitions, same units)
  • encourages comparison (“payment looks different from everyone else”)
  • highlights the outlier quickly
  • acts like a service directory with health baked in

Level 3: Service Details

Now you zoom into one service and ask “what changed?” This dashboard is for the owning team and the on-call engineer.

Example layout

Characteristics:

  • one service only, but deep coverage
  • shows history (you need trends, not just “now”)
  • includes change events so you can correlate spikes with deployments
  • makes it easy to jump from “symptom” to “evidence” (traces and logs)

Level 4: Component Details

Finally, when you suspect a dependency, you need a component-focused dashboard. This can be a database, cache cluster, queue, or an external provider integration.

Example layout

Characteristics

  • narrow scope, high signal
  • tuned for root cause analysis
  • pairs well with tracing (the trace points here, this dashboard explains why)

Dashboard Design Principles

Great dashboards are not collections of charts. They are tools for making decisions quickly. When the system is on fire, nobody wants to interpret a wall of graphs. They want answers, context, and a clear next step.

1. Answer Specific Questions

Every dashboard should have one primary question it answers.

  • Good: “Is checkout working right now?”
  • Bad: “All our metrics”

If you cannot describe a dashboard’s purpose in one sentence, it is trying to do too much. Split it into smaller dashboards with clear roles.

A simple rule: one dashboard, one decision.

2. Use Visual Hierarchy

People scan dashboards in a predictable way. Put the highest signal at the top and toward the left. Make it obvious what matters.

A useful structure:

  • Top row: status indicators (availability, error rate, p99 latency, traffic)
  • Middle: supporting detail (breakdowns by endpoint, region, status code)
  • Bottom: investigation helpers (links to logs, traces, deployments, runbooks)

You want the dashboard to work even if someone only looks at it for five seconds.

3. Show Context

A chart without context forces the viewer to guess what “normal” looks like and what caused the change.

Bad: a plain latency line graph

Good: latency with context overlays:

  • an SLO threshold line
  • deploy and config change markers
  • comparison to last week or last day
  • normal range shading (baseline band)

Context turns “the line went up” into “p99 crossed the SLO right after a deployment.”

4. Use Appropriate Time Ranges

The same metric looks completely different at different time windows. Dashboards should default to ranges that match their job.

Dashboard TypeDefault RangeGranularity
Real-time monitoring15 min - 1 hourSeconds
Incident investigation6 hoursMinutes
Daily review24-48 hours5 minutes
Weekly trends7 daysHours
Capacity planning30-90 daysDays

You can always zoom in or out, but defaults should be sensible. Wrong defaults waste time and hide issues.

5. Consistent Layout

Consistency reduces cognitive load. If every service dashboard has a different layout, engineers spend time re-learning the UI instead of debugging.

A common pattern that works well:

Row 1: Service health (RED method)

  • Rate (traffic)
  • Errors
  • Duration (latency percentiles)

Row 2: Saturation and resources

  • CPU, memory, GC
  • thread pool / connection pool
  • queue depth

Row 3: Dependencies

  • database latency and errors
  • cache hit rate and latency
  • external APIs and timeouts

Row 4: Events and breadcrumbs

  • deployments
  • feature flags and config changes
  • alerts and incidents

The exact rows may differ by system, but the structure should remain stable.

Dashboards should not be dead ends. Every important chart should lead somewhere useful.

Ideally, each panel has links to:

  • logs filtered to the same timeframe and service
  • traces for exemplars or representative requests
  • the relevant runbook section
  • alert definitions and SLO configuration

The best dashboards act like a navigation system: detect → narrow down → investigate → fix.

Building Effective Graphs

Choose the Right Visualization

Data TypeGood VisualizationBad Visualization
Time seriesLine chartPie chart
Current valueGauge, stat panelLine chart
DistributionHistogram, heatmapLine chart
ComparisonBar chartPie chart
StatusTraffic light, tableGraph

Visualization choice matters because it changes what the viewer notices first.

Time Series Best Practices

A good time-series panel is focused and readable.

  • one metric or a small family (2–5 lines)
  • clear legend and consistent units
  • y-axis starts at 0 when it makes sense
  • threshold line for SLO or alert limits
  • deploy and change markers on the timeline
  • time window aligned to the dashboard’s purpose

When in doubt, optimize for readability. People make mistakes when graphs are crowded.

Avoid Common Mistakes

1) Too many lines

Bad: 20 endpoints on one graph

Good: top 5 endpoints + “others” aggregated, with drill-down links

If you need more than five lines, you usually need a breakdown table and filters.

2) Misleading axes

Bad: y-axis from 99.0 to 100.0 (makes tiny changes look massive)

Good: axis ranges that match the meaning of the metric, and avoid exaggeration unless there is a clear reason

A dashboard should build trust, not create drama.

3) Wrong aggregation

Bad: averages of averages, especially for latency

Good: percentiles computed from histograms

Latency should almost always be reported as percentiles (p50, p95, p99). Averages hide tail pain.

What Are Runbooks?

A runbook is a written, step-by-step guide for handling an operational task or responding to an incident. It captures the practical knowledge you need when something breaks, including how to confirm impact, how to diagnose the cause, and how to restore the system safely.

Why Runbooks Matter

Without runbooks

  • senior engineers end up handling most incidents because they “know the system”
  • knowledge stays in people’s heads and disappears when they leave or change teams
  • incident response is inconsistent, depending on who is on-call
  • time to resolution increases because everyone starts from scratch
  • mistakes are more likely because people improvise under pressure

With runbooks

  • any trained engineer can respond, even if they did not build the system
  • responses are consistent and easier to audit
  • mean time to recovery improves because the first steps are already written down
  • onboarding gets easier because operational knowledge is documented
  • teams learn over time because the runbook evolves after each incident

The best runbooks also reduce stress. When it is 3 AM, you want a clear checklist, not a puzzle.

When to Create Runbooks

Create a runbook when:

  • an alert can page someone
  • a task is performed repeatedly (rotations, restarts, backfills, deployments)
  • multiple people might need to execute it (on-call rotation, follow-the-sun teams)
  • the consequences of a wrong action are high (data loss, outages, security impact)
  • the system has sharp edges (manual failover, emergency switches, rate limit changes)

A practical rule: every paging alert should link to a runbook. If you wake someone up, you should also tell them what to do next.

What makes a runbook good

A good runbook is:

  • fast to use during an incident (checklists, short steps, direct links)
  • specific (exact dashboards, exact queries, exact commands)
  • safe (clear warnings, permissions, rollback steps)
  • actionable (it leads to a decision or a fix, not just background theory)
  • maintained (reviewed regularly and updated after incidents)

Avoid runbooks that read like architecture docs. During an incident, nobody wants a long explanation. They want the next step.

Runbook Structure

A consistent template makes runbooks easier to scan. You should be able to find the right section in seconds.

Template

Writing Effective Runbooks

A runbook is only useful if it works under pressure. The reader might be new to the system, half-asleep, and trying to restore service quickly. That means your runbook has to be concrete, copy-pasteable, and opinionated.

Here are the practices that consistently separate great runbooks from useless ones.

1. Be Specific

Vague instructions create hesitation and mistakes.

Bad: “Check the database.”

Good: “Check active connections on the primary and compare against the safe threshold.”

Example:

2. Include Commands

A runbook should minimize typing and improvisation. Provide commands that can be copied directly, ideally with placeholders clearly marked.

3. Show Expected Output

People waste time second-guessing whether a command “worked.” Show what healthy and unhealthy results look like.

Example:

4. Explain Why

Runbooks should not be long essays, but a short “why” prevents blind actions and helps engineers learn the system.

Example:

5. Include Decision Points

Runbooks should branch. If everything is a linear list, people will follow irrelevant steps and waste time.

Example:

6. Document Escalation Clearly

During incidents, ambiguity about escalation slows response. Be explicit about when to escalate, to whom, and what to include.

Example:

Runbook Types

Not all runbooks are the same. Different runbooks serve different goals, and treating them as interchangeable usually leads to either missing detail during incidents or overly long documents that nobody reads.

A useful way to categorize them is by when they are used and how specific the trigger is.

Alert Runbooks

These are the most common and the most important for on-call. The trigger is explicit, the response needs to be fast, and the runbook should be tightly scoped.

Rule of thumb: one alert, one runbook, linked directly from the alert notification.

Example:

Operational Runbooks

These are procedures for routine, repeatable tasks. The goal is correctness and safety, not urgent incident response.

Troubleshooting Guides

These are broader investigation guides used when you have symptoms, but no single alert points to a clear root cause. They are less checklist-driven and more diagnostic.

Connecting Everything

Dashboards, runbooks, and alerts are not separate tools. They form a single incident response loop. When they are properly connected, on-call work becomes fast and predictable instead of frantic guessing.

A good flow looks like this:

Alert → Runbook

Every alert notification includes:

  • Link to the relevant runbook
  • Link to the relevant dashboard
  • Key context (current values, threshold)

Runbook → Dashboard

Every runbook includes:

  • Links to dashboards mentioned in steps
  • Pre-filtered log queries
  • Trace search queries

Dashboard → Investigation

Every dashboard panel links to:

  • Deeper dashboards
  • Filtered log searches
  • Trace queries for that service

Example Flow

Notice how every step has a direct link to the next. No searching. No “where is that dashboard again?”

Maintenance and Governance

Dashboards and runbooks rot over time. Systems change, teams change, and links break. A lightweight maintenance process keeps them reliable.

Dashboard Maintenance

TaskFrequencyOwner
Review for accuracyMonthlyService team
Remove unused dashboardsQuarterlyPlatform team
Update after architecture changesAs neededService team
Performance optimizationQuarterlyPlatform team

Runbook Maintenance

TaskFrequencyOwner
Verify commands still workQuarterlyService team
Update after incidentsAfter each incidentIncident responder
Review for accuracyBi-annuallyService team
Remove obsolete runbooksAnnuallyPlatform team

Review Checklist

Dashboard review

  • every panel shows data (no permanent “No data”)
  • thresholds and SLO lines still match current targets
  • links work (logs, traces, deeper dashboards, runbooks)
  • queries load fast enough for incident use
  • it still answers the question it claims to answer

Runbook review

  • commands execute correctly with current tooling and permissions
  • expected outputs still match reality
  • links work (dashboards, logs, traces, status pages)
  • contact information and escalation paths are current
  • mitigations are safe and ordered from lowest to highest risk
  • it reflects lessons from recent incidents

Summary

Dashboards make observability data accessible:

  • Hierarchy: Executive overview → service health → service details → components
  • Design: Clear purpose, visual hierarchy, context, appropriate time ranges
  • Graphs: Right visualization for data type, avoid too many lines, show thresholds
  • Links: Connect to logs, traces, and runbooks

Runbooks make knowledge actionable:

  • Structure: Overview, trigger, impact, diagnosis, resolution, escalation
  • Content: Specific commands, expected output, decision points, explanations
  • Types: Alert runbooks, operational procedures, troubleshooting guides
  • Maintenance: Update after incidents, periodic review, verify commands work

Integration is key:

  • Alerts link to runbooks and dashboards
  • Runbooks link to dashboards and investigation tools
  • Dashboards link to deeper investigation

Together, dashboards and runbooks enable anyone on call to respond effectively to incidents, not just the experts who built the system.