AlgoMaster Logo

Design Monitoring and Alerting System

Ashish

Ashish Pratap Singh

medium

In this chapter, we'll explore the high-level design of such a system including how it works, the key components involved, and the architecture behind real-time detection and alerting.

Let’s start by clarifying the requirements.

1. Clarifying Requirements

Before diving into the design, let's narrow down the scope of the problem. Here’s an example of how a discussion between candidate and interviewer might flow:

1.1 Functional Requirements

  • Data Collection: Collect diverse metrics (CPU, memory, disk I/O, latency, error rates) from various sources (application servers, databases, containers).
  • Data Storage: Efficiently store and manage metrics data with configurable retention periods.
  • Alerting: Allow users to define rules like “CPU usage > 90% for 5 minutes” and trigger alerts accordingly.
  • Notifications: Send alerts via email, SMS, Slack, PagerDuty, or other integrations.

1.2 Non-Functional Requirements

  • Scalability: Handle millions of metrics per second across large, distributed systems.
  • Availability: Stay operational even during failures or partial outages.
  • Low Latency: Ensure alerts and dashboards reflect recent data within 1–2 minutes.
  • Durability: Preserve historical data reliably for long-term analysis.
  • Extensibility: Easy to integrate new metric sources, alert types and notification channels.

2. Capacity Estimation

Metrics Ingestion

  • Total metrics per minute: 10,000 × 100 = 1,000,000 metrics/minute
  • Per second: 1,000,000 ÷ 60 ≈ 16,700 metrics/sec

Storage Estimation

  • Average size per metric: 150 bytes (including metadata)
  • Per second: 16,700 × 150 ≈ 2.5 MB/sec
  • Per day: 2.5 MB × 86,400 ≈ 216 GB/day

Alerting Load

If 0.1% of the metrics generate alerts:

  • 0.1% of 1,000,000 = 1,000 alerts/minute 17 alerts/sec

3. High-Level Architecture

Premium Content

This content is for premium members only.