A Monitoring and Alerting System continuously collects metrics and logs from applications, servers, and infrastructure components. It analyzes this data to detect performance issues, errors, or anomalies and promptly notifies the appropriate teams so they can take action before users are impacted.
Popular tools that implement these ideas include Prometheus, Datadog, and Nagios.
In this chapter, we'll explore the high-level design of such a system including how it works, the key components involved, and the architecture behind real-time detection and alerting.
Let’s start by clarifying the requirements.
Before diving into the design, let's narrow down the scope of the problem. Here’s an example of how a discussion between candidate and interviewer might flow:
Candidate: "Should the system support both metrics and logs?"
Interviewer: "Lets focus only on metrics for this design"
Candidate: "Do we need to support real-time monitoring?"
Interviewer: "Yes, metrics should be collected and evaluated in near real time."
Candidate: "How should alerts be delivered?"
Interviewer: "Alerts should be sent through multiple channels such as email, SMS, and integrations with tools like Slack."
Candidate: "Should the system support custom alerting rules?"
Interviewer: "Yes, users should be able to define thresholds, anomaly detection rules, or queries."
Candidate: "Do we need visualization dashboards?"
Interviewer: "No, lets skip visualizations and dashboards for now."
Candidate: "What kind of scale should we assume?"
Interviewer: "Assume millions of metrics per second from thousands of servers and services."
Candidate: "Should the system support historical analysis?"
Interviewer: "Yes, metrics should be retained for weeks or months."
Number of servers: 10,000
Each server emits: 100 metrics per minute
If 0.1% of the metrics generate alerts: