Last Updated: January 7, 2026
You have metrics, logs, traces, and alerts. But at 3 AM when an alert fires, the on-call engineer needs to know what to do. They need to quickly understand the system state, identify the problem, and take action. Raw data is not enough.
Dashboards transform raw metrics into actionable visualizations. They answer questions at a glance: Is the system healthy? What changed? Where is the problem?
Runbooks transform knowledge into repeatable procedures. They answer: What do I do when this alert fires? They capture the expertise of senior engineers and make it available to whoever is on call. Without runbooks, every incident requires discovering the fix from scratch.
In this chapter, you will learn:
This chapter brings together everything we have covered. Dashboards and runbooks are how your team actually uses observability data during incidents.
Dashboards work best when they are organized in layers, from a high-level overview to deep investigation. The goal is to help someone answer questions quickly without drowning in charts.
A good hierarchy follows how incidents actually unfold:
This is the single screen you open first. It should tell you, at a glance, whether the system is healthy and whether you are meeting SLOs.
This dashboard is for everyone: on-call engineers, team leads, even non-engineering stakeholders.
Once you know something is wrong, the next question is “where?” This dashboard should show all services side by side using the same handful of metrics.
Now you zoom into one service and ask “what changed?” This dashboard is for the owning team and the on-call engineer.
Finally, when you suspect a dependency, you need a component-focused dashboard. This can be a database, cache cluster, queue, or an external provider integration.
Great dashboards are not collections of charts. They are tools for making decisions quickly. When the system is on fire, nobody wants to interpret a wall of graphs. They want answers, context, and a clear next step.
Every dashboard should have one primary question it answers.
If you cannot describe a dashboard’s purpose in one sentence, it is trying to do too much. Split it into smaller dashboards with clear roles.
A simple rule: one dashboard, one decision.
People scan dashboards in a predictable way. Put the highest signal at the top and toward the left. Make it obvious what matters.
A useful structure:
You want the dashboard to work even if someone only looks at it for five seconds.
A chart without context forces the viewer to guess what “normal” looks like and what caused the change.
Bad: a plain latency line graph
Good: latency with context overlays:
Context turns “the line went up” into “p99 crossed the SLO right after a deployment.”
The same metric looks completely different at different time windows. Dashboards should default to ranges that match their job.
| Dashboard Type | Default Range | Granularity |
|---|---|---|
| Real-time monitoring | 15 min - 1 hour | Seconds |
| Incident investigation | 6 hours | Minutes |
| Daily review | 24-48 hours | 5 minutes |
| Weekly trends | 7 days | Hours |
| Capacity planning | 30-90 days | Days |
You can always zoom in or out, but defaults should be sensible. Wrong defaults waste time and hide issues.
Consistency reduces cognitive load. If every service dashboard has a different layout, engineers spend time re-learning the UI instead of debugging.
A common pattern that works well:
Row 1: Service health (RED method)
Row 2: Saturation and resources
Row 3: Dependencies
Row 4: Events and breadcrumbs
The exact rows may differ by system, but the structure should remain stable.
Dashboards should not be dead ends. Every important chart should lead somewhere useful.
Ideally, each panel has links to:
The best dashboards act like a navigation system: detect → narrow down → investigate → fix.
| Data Type | Good Visualization | Bad Visualization |
|---|---|---|
| Time series | Line chart | Pie chart |
| Current value | Gauge, stat panel | Line chart |
| Distribution | Histogram, heatmap | Line chart |
| Comparison | Bar chart | Pie chart |
| Status | Traffic light, table | Graph |
Visualization choice matters because it changes what the viewer notices first.
A good time-series panel is focused and readable.
When in doubt, optimize for readability. People make mistakes when graphs are crowded.
Bad: 20 endpoints on one graph
Good: top 5 endpoints + “others” aggregated, with drill-down links
If you need more than five lines, you usually need a breakdown table and filters.
Bad: y-axis from 99.0 to 100.0 (makes tiny changes look massive)
Good: axis ranges that match the meaning of the metric, and avoid exaggeration unless there is a clear reason
A dashboard should build trust, not create drama.
Bad: averages of averages, especially for latency
Good: percentiles computed from histograms
Latency should almost always be reported as percentiles (p50, p95, p99). Averages hide tail pain.
A runbook is a written, step-by-step guide for handling an operational task or responding to an incident. It captures the practical knowledge you need when something breaks, including how to confirm impact, how to diagnose the cause, and how to restore the system safely.
The best runbooks also reduce stress. When it is 3 AM, you want a clear checklist, not a puzzle.
Create a runbook when:
A practical rule: every paging alert should link to a runbook. If you wake someone up, you should also tell them what to do next.
A good runbook is:
Avoid runbooks that read like architecture docs. During an incident, nobody wants a long explanation. They want the next step.
A consistent template makes runbooks easier to scan. You should be able to find the right section in seconds.
A runbook is only useful if it works under pressure. The reader might be new to the system, half-asleep, and trying to restore service quickly. That means your runbook has to be concrete, copy-pasteable, and opinionated.
Here are the practices that consistently separate great runbooks from useless ones.
Vague instructions create hesitation and mistakes.
Bad: “Check the database.”
Good: “Check active connections on the primary and compare against the safe threshold.”
Example:
A runbook should minimize typing and improvisation. Provide commands that can be copied directly, ideally with placeholders clearly marked.
People waste time second-guessing whether a command “worked.” Show what healthy and unhealthy results look like.
Example:
Runbooks should not be long essays, but a short “why” prevents blind actions and helps engineers learn the system.
Example:
Runbooks should branch. If everything is a linear list, people will follow irrelevant steps and waste time.
Example:
During incidents, ambiguity about escalation slows response. Be explicit about when to escalate, to whom, and what to include.
Example:
Not all runbooks are the same. Different runbooks serve different goals, and treating them as interchangeable usually leads to either missing detail during incidents or overly long documents that nobody reads.
A useful way to categorize them is by when they are used and how specific the trigger is.
These are the most common and the most important for on-call. The trigger is explicit, the response needs to be fast, and the runbook should be tightly scoped.
Rule of thumb: one alert, one runbook, linked directly from the alert notification.
Example:
These are procedures for routine, repeatable tasks. The goal is correctness and safety, not urgent incident response.
These are broader investigation guides used when you have symptoms, but no single alert points to a clear root cause. They are less checklist-driven and more diagnostic.
Dashboards, runbooks, and alerts are not separate tools. They form a single incident response loop. When they are properly connected, on-call work becomes fast and predictable instead of frantic guessing.
A good flow looks like this:
Every alert notification includes:
Every runbook includes:
Every dashboard panel links to:
Notice how every step has a direct link to the next. No searching. No “where is that dashboard again?”
Dashboards and runbooks rot over time. Systems change, teams change, and links break. A lightweight maintenance process keeps them reliable.
| Task | Frequency | Owner |
|---|---|---|
| Review for accuracy | Monthly | Service team |
| Remove unused dashboards | Quarterly | Platform team |
| Update after architecture changes | As needed | Service team |
| Performance optimization | Quarterly | Platform team |
| Task | Frequency | Owner |
|---|---|---|
| Verify commands still work | Quarterly | Service team |
| Update after incidents | After each incident | Incident responder |
| Review for accuracy | Bi-annually | Service team |
| Remove obsolete runbooks | Annually | Platform team |
Dashboards make observability data accessible:
Runbooks make knowledge actionable:
Integration is key:
Together, dashboards and runbooks enable anyone on call to respond effectively to incidents, not just the experts who built the system.