A notification service is a system that delivers messages or alerts to users across multiple channels such as email, SMS, push notifications, or in-app messages.
Examples include Firebase Cloud Messaging (FCM), Amazon SNS, and notification systems built into apps like WhatsApp or Twitter.
In this chapter, we will walk through the high-level design of a notification service.
Let’s begin by clarifying the requirements.
Before diving into the design, lets outline the functional and non-functional requirements.
This means the system should handle:
Assuming average notification and user data size of 1KB.
On a high level, our system will consist of the following components:
The Notification Service is the entry point for all notification requests, either from external applications or internal systems. It exposes APIs that various clients can call to trigger notifications.
These could be requests to send transactional notifications (e.g., password reset emails), promotional notifications (e.g., discount offers), or system alerts (e.g., downtime warnings).
Each request is validated to ensure it contains all the necessary information, such as the recipient’s ID, notification type, message content, and channels through which the notification should be sent (email, SMS, etc.).
For notifications that need to be sent at a future date or time, the Notification Service integrates with the Scheduler Service.
After processing the request, the Notification Service pushes the notifications to a Notification Queue (e.g., Kafka or RabbitMQ).
The User Preference Service allows users to control how they receive notifications.
It stores and retrieves individual user preferences for receiving notifications across different channels.
The service tracks which types of notifications users have explicitly opted into or out of.
Example: Users may opt out of marketing or promotional content
To prevent users from being overwhelmed by notifications, the User Preference Service enforces frequency limits for certain types of notifications, especially promotional messages.
Example: A user may only receive 2 promotional notifications per day
The Scheduler Service is responsible for storing and tracking the scheduled notifications—notifications that need to be sent at a specific future time.
These can include reminders, promotional campaigns, or other time-sensitive notifications that are not sent immediately but must be triggered based on a predefined schedule.
Example: A promotional message might be scheduled for delivery next week.
Once the scheduled time arrives, the Scheduler Service pulls the notification from its storage and sends it to the Notification Queue.
The Notification Queue acts as a buffer between the Notification Service and the Channel Processors.
By decoupling the notification request submission from the notification delivery, the queue enables the system to scale much more effectively, particularly during high-traffic periods.
The Queue System provides guarantees around message delivery.
Depending on the use case, it can be configured for:
The Channel Processors are responsible for pulling notifications from the Notification Queue and delivering them to users via specific channels, such as email, SMS, push notifications, and in-app notifications.
By decoupling the Notification Service from the actual delivery, Channel Processors enable independent scaling and asynchronous processing of notifications.
This setup allows each processor to focus on its designated channel, ensuring reliable delivery with built-in retry mechanisms and handling failures efficiently.
The Database/Storage layer manages large volumes of data, including notification content, user preferences, scheduled notifications, delivery logs, and metadata.
The system requires a mix of storage solutions to support various needs:
An external system (e.g., an e-commerce platform, a system alert generator, or a marketing system) generates a notification request.
The Notification Service (via an API Gateway / Load Balancer) receives the notification request.
The request is authenticated and validated to ensure it’s coming from an authorized source and all necessary information (recipient, message, channels, etc.) is present and correct.
The Notification Service queries the User Preference Service to retrieve:
If the notification is scheduled for future delivery (e.g., a reminder for tomorrow or a marketing email next week), the Notification Service sends the notification to Scheduler Service which stores the notification along with its scheduled delivery time in a time-based database or NoSQL database that allows for efficient querying based on time.
The scheduled_notifications table is partitioned on scheduled_time so that the system can efficiently retrieve only the notifications that fall within the relevant time range, rather than scanning the entire table.
The Scheduler Service continuously queries the storage for notifications that are due for delivery.
Example: Every minute (or based on a more granular interval), the service queries for notifications that need to be delivered in the next time window (e.g., next 1–5 minutes).
When the scheduled time arrives, the Scheduler Service takes the notification and sends it to the Notification Queue.
Based on the user’s preferences and the request, the Notification Service uses templates (if needed) to dynamically generate and format the message for each channel:
Once the Notification Service has created and formatted the messages for the required channels, it places each message into the respective topic in the Notification Queue System (e.g., Kafka, RabbitMQ, AWS SQS).
Each channel (email, SMS, push, etc.) has its own dedicated topic, ensuring that the messages are processed independently by the relevant Channel Processors.
Example: If the notification needs to be sent via email, SMS, and push, the Notification Service generates three messages, each tailored to the respective channel.
These topics allow each Channel Processor to focus on consuming messages relevant to its channel, reducing complexity and improving processing efficiency.
Each message contains the notification payload, channel-specific information, and metadata (such as priority and retry count).
The Notification Queue stores the messages until the relevant Channel Processors pull them for processing.
Each channel processor acts as a consumer to the queue and responsible for consuming its own messages:
Each Channel Processor handles the delivery of the notification through the specified channel:
Each Channel Processor waits for an acknowledgment from the external provider:
The Channel Processors logs each notification’s status in the notification_logs table for future reference, auditing, and reporting.
If a notification delivery fails due to a temporary issue (e.g., third-party provider downtime), the Channel Processor will attempt to resend the notification.
Typically, an exponential backoff strategy is used, where each retry is delayed by progressively longer intervals.
If the notification remains undelivered after a set number of retries, it is moved to the Dead Letter Queue (DLQ) for further handling.
Administrators can then manually review and reprocess messages in the DLQ as needed.
The system should be designed for horizontal scalability, meaning components can scale by adding more instances as the load increases.
To efficiently handle large datasets, particularly for user data and notification logs, sharding and partitioning distribute the load across multiple databases or geographic regions:
Implement caching with solutions like Redis or Memcached to store frequently accessed data, such as user preferences.
Caching reduces the load on the database and improves response times for real-time notifications by avoiding repeated database lookups.
For high availability, data (e.g., user preferences, logs) should be replicated across multiple data centers or regions. This ensures that even if one region fails, the data is available elsewhere.
Multi-AZ Replication: Store data in multiple availability zones to provide redundancy.
A load balancer should be used to distribute incoming traffic evenly among instances of the Notification Service, ensuring that no single instance becomes a bottleneck.
To ensure smooth operation at scale, the system should have:
Implement robust authentication (e.g., OAuth 2.0) for all incoming requests to the notification service. Use Role-Based Access Control (RBAC) to limit access to critical services.
Protect the service from abuse by implementing rate limiting on the API gateway to avoid DoS attacks.
As a notification system handles large volumes of data over time, it is important to implement a strategy for archiving old data.
Archiving involves moving outdated or less frequently accessed data (e.g., old delivery logs, notification content, and user history) from the main storage to a lower-cost, longer-term storage solution.
Which component in a notification service is responsible for enforcing user-specific limits on promotional messages per day?