AlgoMaster Logo

Rate Limiting and API Management

Last Updated: March 15, 2026

Ashish

Ashish Pratap Singh

AI applications often rely on external APIs such as LLM providers, embedding services, vector databases, and third-party tools. These services typically enforce strict rate limits, usage quotas, and pricing tiers. Without proper control mechanisms, your system can quickly exceed limits, experience service disruptions, or incur unexpected costs.

In this chapter, we explore how to manage and protect AI systems using rate limiting and API management techniques.

Rate Limiting Algorithms for AI Endpoints

If you have built web applications before, you have probably seen rate limiting at the HTTP layer, something like "100 requests per minute per IP." AI endpoints need a more nuanced approach because not all requests are equal. A request that generates 50 tokens costs 100x less than one that generates 5,000 tokens. Counting raw requests is a rough proxy at best.

Let's look at the two most practical algorithms and how they apply to AI traffic.

Token Bucket

The token bucket is the most intuitive rate limiting algorithm, and the name is unfortunately confusing in an AI context since "tokens" here means rate limit tokens, not LLM tokens. Think of it as a bucket that holds a fixed number of permits. Each request consumes one or more permits. The bucket refills at a steady rate. If the bucket is empty, the request either waits or gets rejected.

Here is why this works well for AI endpoints: you can make the "cost" of each request proportional to the actual resources it consumes. A short summarization request might cost 1 permit, while a long document analysis costs 10. This gives you much fairer allocation than raw request counting.

main.py
Loading...

The key design decision is what "cost" means for your application. Some teams use 1 permit per request regardless of size, which is simple but unfair. Others charge based on estimated input + output tokens, which is fairer but requires you to estimate output length before the request completes. A practical middle ground is to define 3-4 cost tiers (small, medium, large, extra-large) based on the prompt template or feature being called.

Sliding Window

The token bucket allows bursts, a user could burn through all 100 permits in the first minute of the hour and then wait 59 minutes. Sometimes you want smoother traffic. The sliding window algorithm tracks requests over a rolling time window, giving you more even distribution.

main.py
Loading...

Which algorithm should you use? For most AI applications, the token bucket is the better default. It handles variable-cost requests naturally, and the burst tolerance is actually desirable. Users tend to interact in bursts (open the app, send five messages, close it), and penalizing bursty behavior creates a worse user experience. Use the sliding window when you need strict, even distribution, like when you are trying to stay under an upstream provider's per-minute limit.

Scroll
AlgorithmBurst ToleranceVariable CostBest For
Token BucketYes (up to capacity)Native supportPer-user limits, API endpoints
Sliding WindowNo (even distribution)Needs adaptationUpstream limit compliance, fair scheduling

Layered Rate Limits

A single rate limit is rarely enough. In practice, you need multiple layers working together. Think of it like a building's security: there's a front door lock, a floor-level badge reader, and individual office keys. Each layer catches different types of abuse.

Let's build a layered rate limiter that handles all of these in one clean interface:

main.py
Loading...

Output:

A few things to notice here. The global limiter uses a sliding window because you want smooth, even traffic toward your upstream provider. The per-user limiter uses a token bucket because users interact in bursts and you do not want to penalize natural behavior. The feature limiter uses a sliding window because you want to prevent any single feature from dominating a user's allocation.

This layered approach also makes debugging straightforward. When a request is rejected, you know exactly which layer blocked it and can give the user a specific, helpful error message rather than a generic "too many requests."

Priority Queuing

Rate limiting tells you whether a request is allowed. Priority queuing tells you which request goes first when resources are scarce. This distinction matters a lot for AI applications because you almost always have mixed workloads.

Consider a typical AI product. You have users chatting in real time (they need responses in 1-2 seconds), analysts running summarization jobs on batches of documents (they can wait minutes), and background jobs re-indexing content (they can wait hours). If all three hit your LLM endpoint at the same time and you have capacity for only one, which one should go first?

The answer is obvious: the interactive user. But without priority queuing, your system treats all requests equally, and "equally" usually means first-come-first-served, which means whichever request arrived first wins, regardless of urgency.

main.py
Loading...

The key insight here is that asyncio.PriorityQueue dequeues the smallest item first. Since Priority.INTERACTIVE has value 0 and Priority.BATCH has value 2, interactive requests always jump ahead. Within the same priority level, the timestamp field ensures FIFO ordering so no single request gets starved indefinitely.

Starvation Prevention

There is a real risk with strict priority queuing: low-priority requests can starve. If interactive traffic is constant, batch jobs might never run. The standard fix is priority aging, where you gradually boost the priority of requests the longer they wait:

main.py
Loading...

With this approach, a batch request that has been waiting for 5 minutes gets promoted to the same priority as interactive requests. You get the benefits of prioritization without the risk of any request waiting forever.

Budget Management

Rate limits protect against traffic spikes. Budget management protects against cost overruns. They solve different problems and you need both.

Here is the core idea: every AI request has a dollar cost. You track spending in real time and enforce caps at multiple levels. When a cap is hit, new requests are either queued, downgraded (routed to a cheaper model), or rejected.

main.py
Loading...

There is an important subtlety here. You check the budget before the request and record the actual cost after. The estimated cost might differ from the actual cost because you do not know exactly how many tokens the LLM will generate. That is fine. The pre-flight check prevents obvious overruns, and the post-flight recording keeps the books accurate. Over time, you can tune your cost estimates based on historical data per feature.

Smart Degradation

Hard budget caps create a bad user experience. Imagine you are chatting with an AI assistant and suddenly get "budget exceeded, try again tomorrow." A better approach is graceful degradation: when a budget threshold is hit, route requests to a cheaper model instead of rejecting them.

main.py
Loading...

This gives users a progressively worse, but still functional, experience rather than a hard cutoff. Most users will never notice the degradation, because for routine queries the cheaper models produce similar enough results.

Handling Upstream Rate Limits

So far we have talked about rate limiting your own users. But you are also on the receiving end of rate limits from providers like OpenAI, Anthropic, and Google. These providers enforce limits on requests per minute (RPM), tokens per minute (TPM), and sometimes requests per day. When you hit their limits, you get HTTP 429 responses with a Retry-After header.

The naive approach is to catch 429 errors and retry with exponential backoff. That works for light traffic. But when you have 50 concurrent users all hitting the same upstream API, naive retries cause a thundering herd: all retried requests land at roughly the same time, triggering another round of 429s.

The right approach is proactive rate limiting: track your upstream provider's limits locally and throttle yourself before you hit them. Most providers include rate limit headers in every response (x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, x-ratelimit-reset-requests). Use those to keep a local model of how much headroom you have.

main.py
Loading...

The consecutive_429s counter is useful for escalation logic. If you are getting rate limited repeatedly, something is fundamentally wrong: maybe your traffic has grown beyond your tier, or another service sharing the same API key is consuming your quota. At 3+ consecutive 429s, you might want to alert on-call or switch to a fallback provider.

Load Shedding

When upstream limits are hit and your queue is growing, you need to decide which requests to drop. This is where priority queuing and load shedding work together:

  1. Drop BACKGROUND priority requests first
  2. Then drop BATCH requests
  3. Downgrade STANDARD requests to a cheaper model
  4. Keep INTERACTIVE requests on the best available model

This is essentially triage. You protect the experience of your highest-value traffic by sacrificing lower-value work that can be retried later.

API Key Management

API key management sounds mundane until your key leaks and someone mines cryptocurrency on your OpenAI account. Or until a departing team member's personal API key, hardcoded in production, gets deactivated and your entire AI pipeline goes down on a Friday evening.

Good API key management has three pillars: rotation, scoping, and tracking.

Key Rotation

Never use a single API key for everything. Create separate keys for each environment (development, staging, production) and each service. Rotate keys on a schedule, and make the rotation process automated so it actually happens.

main.py
Loading...

A few practices that save you from operational headaches:

  • Two keys minimum per environment. When you rotate one, the other keeps working. Zero-downtime rotation.
  • 90-day expiration maximum. Force rotation by making keys expire. If your code cannot handle rotation, you will find out in staging, not production.
  • Per-key budgets. If a key leaks, the damage is capped. The secondary production key has a $20/day budget, enough to keep things running during rotation but not enough to bankrupt you.
  • Never hardcode keys. Always load from environment variables or a secrets manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault). This is table stakes, yet a surprising number of AI projects still have keys in config files checked into git.

Putting It All Together

Let's see how these components compose into a complete API management layer. In a real system, you would wire these into your API framework (FastAPI, Flask, etc.) as middleware or dependency injection.

main.py
Loading...

When to Use What

Not every AI application needs all of these components. Here is a guide for right-sizing your API management:

Scroll
StageWhat You NeedSkip For Now
Prototype / Internal toolPer-user token bucket, basic budget alertsPriority queuing, key rotation, load shedding
Early production (< 100 users)Layered rate limits, budget caps, upstream backoffPriority queuing, multi-provider failover
Growth (100-10,000 users)All of the above + priority queuing, key rotationLoad shedding (unless you have batch workloads)
Scale (10,000+ users)Everything in this chapter, plus dedicated API gateway (Kong, Tyk, or custom)Nothing. You need all of it.

The managed API gateway services (AWS API Gateway, Kong, Tyk) handle basic rate limiting and key management well. Where they fall short is AI-specific logic: token-based cost tracking, model-aware routing, and priority queuing across different workload types. That is the gap the custom code in this chapter fills.

References