AlgoMaster Logo

Design Distributed Job Scheduler

Ashish

Ashish Pratap Singh

10 min read

In this article, we will walk through the process of designing a scalable distributed job scheduling service that can handle millions of tasks, and ensure high availability.

Let’s begin by clarifying the requirements.

1. Requirement Gathering

Before diving into the design, let’s outline the functional and non-functional requirements.

Functional Requirements:

  1. Users can submit one-time or periodic jobs for execution.
  2. Users can cancel the submitted jobs.
  3. The system should distribute jobs across multiple worker nodes for execution.
  4. The system should provide monitoring of job status (queued, running, completed, failed).
  5. The system should prevent the same job from being executed multiple times concurrently.

Non-Functional Requirements:

  • Scalability: The system should be able to schedule and execute millions of jobs.
  • High Availability: The system should be fault-tolerant with no single point of failure. If a worker node fails, the system should reschedule the job to other available nodes.
  • Latency: Jobs should be scheduled and executed with minimal delay.
  • Consistency: Job results should be consistent, ensuring that jobs are executed once (or with minimal duplication).

Additional Requirements (Out of Scope):

  1. Job prioritization: The system should support scheduling based on job priority.
  2. Job dependencies: The system should handle jobs with dependencies.

2. High Level Design

Premium Content

This content is for premium members only.