Admission Control System Design: Protecting Systems from Overload

April 12, 202629 min read0 views

admission control

rate limiting

load shedding

system design

frontend architecture

scalability

resilience

concurrency

performance optimization

Admission Control System Design: Protecting Systems from Overload

Introduction

Admission control is the gatekeeper that decides whether to accept or reject incoming work based on the system's current capacity. Instead of accepting every request and degrading under load — slow responses, timeouts, cascading failures — admission control rejects excess requests upfront with a fast, clear error. The rejected client can retry or route elsewhere. The accepted clients get full-quality service.

The principle: it's better to serve 80% of requests well than 100% of requests poorly.

Why Admission Control Matters

Without admission control (overloaded system):

  Request rate: 10,000 req/s
  System capacity: 5,000 req/s

  All 10,000 requests enter the system.
  Each request gets half the resources it needs.
  Response time: 200ms → 4,000ms
  Timeouts cause retries → now 15,000 req/s
  System collapses. Everyone gets errors.

With admission control:

  Request rate: 10,000 req/s
  System capacity: 5,000 req/s

  5,000 requests admitted → served in 200ms ✓
  5,000 requests rejected instantly with 429/503
  Rejected clients retry after backoff
  System stays healthy. Admitted requests get full quality.

Admission Control Strategies

1. Rate Limiting

Limit the number of requests per time window:

import time
from collections import defaultdict

class TokenBucketRateLimiter:
    """Token bucket: allows bursts up to bucket size, refills at steady rate."""
    
    def __init__(self, rate, burst):
        self.rate = rate       # tokens per second
        self.burst = burst     # max tokens (bucket size)
        self.tokens = burst
        self.last_refill = time.monotonic()
    
    def allow(self):
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
        self.last_refill = now
        
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

# Per-client rate limiting
client_limiters = defaultdict(lambda: TokenBucketRateLimiter(rate=100, burst=200))

def handle_request(request):
    limiter = client_limiters[request.client_id]
    if not limiter.allow():
        return Response(status=429, body="Rate limit exceeded",
                       headers={"Retry-After": "1"})
    return process(request)

2. Concurrency Limiting

Limit the number of in-flight requests:

import asyncio

class ConcurrencyLimiter:
    """Reject requests when too many are already in progress."""
    
    def __init__(self, max_concurrent):
        self.max_concurrent = max_concurrent
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.in_flight = 0
    
    async def handle(self, request, handler):
        if self.in_flight >= self.max_concurrent:
            return Response(status=503, body="Server busy")
        
        self.in_flight += 1
        try:
            return await handler(request)
        finally:
            self.in_flight -= 1

# If max_concurrent = 500 and 500 requests are being processed,
# request 501 gets an immediate 503 instead of queuing and waiting.

3. Load-Based Admission

Admit or reject based on real-time system load:

import psutil

class LoadBasedAdmission:
    """Reject requests when system resources are near capacity."""
    
    def __init__(self, cpu_threshold=0.85, memory_threshold=0.90):
        self.cpu_threshold = cpu_threshold
        self.memory_threshold = memory_threshold
    
    def should_admit(self):
        cpu = psutil.cpu_percent(interval=None) / 100
        memory = psutil.virtual_memory().percent / 100
        
        if cpu > self.cpu_threshold:
            return False, f"CPU at {cpu:.0%}"
        if memory > self.memory_threshold:
            return False, f"Memory at {memory:.0%}"
        return True, "OK"

# More sophisticated: Google's CoDel-inspired approach
# Track request latency. If latency exceeds target for a sustained
# period, start dropping requests probabilistically.

4. Priority-Based Admission

Under load, admit high-priority requests and reject low-priority ones:

from enum import IntEnum

class Priority(IntEnum):
    CRITICAL = 0    # Health checks, auth
    HIGH = 1        # Paid tier API calls
    NORMAL = 2      # Free tier API calls
    LOW = 3         # Background syncs, analytics

class PriorityAdmission:
    def __init__(self, capacity):
        self.capacity = capacity
        self.in_flight = 0
        # At each load level, which priorities are admitted
        self.thresholds = {
            0.50: {Priority.CRITICAL, Priority.HIGH, Priority.NORMAL, Priority.LOW},
            0.75: {Priority.CRITICAL, Priority.HIGH, Priority.NORMAL},
            0.90: {Priority.CRITICAL, Priority.HIGH},
            0.95: {Priority.CRITICAL},
        }
    
    def should_admit(self, priority):
        load = self.in_flight / self.capacity
        
        for threshold in sorted(self.thresholds.keys(), reverse=True):
            if load >= threshold:
                return priority in self.thresholds[threshold]
        
        return True  # Under 50% load, admit everything

# At 92% load:
#   CRITICAL requests → admitted ✓
#   HIGH requests     → admitted ✓
#   NORMAL requests   → rejected (429)
#   LOW requests      → rejected (429)

Architecture: Where to Place Admission Control

Layer 1: Edge / CDN
  Block bad actors, DDoS mitigation
  WAF rules, IP rate limits
  
Layer 2: Load Balancer / API Gateway
  Per-client rate limiting
  Request routing, authentication
  Connection limits

Layer 3: Application / Service
  Concurrency limits
  Load-based admission
  Priority-based decisions
  Queue depth limits

Layer 4: Dependencies (DB, cache, downstream)
  Connection pool limits
  Query timeout budgets
  Circuit breakers

┌─────────────────────────────────────────┐
│                 Internet                 │
└────────────────────┬────────────────────┘
                     ▼
           ┌──────────────────┐
           │  CDN / WAF       │  ← IP rate limits, DDoS
           │  (Layer 1)       │
           └────────┬─────────┘
                    ▼
           ┌──────────────────┐
           │  API Gateway     │  ← Per-client rate limits
           │  (Layer 2)       │
           └────────┬─────────┘
                    ▼
           ┌──────────────────┐
           │  Application     │  ← Concurrency + priority
           │  (Layer 3)       │
           └────────┬─────────┘
                    ▼
           ┌──────────────────┐
           │  Database        │  ← Connection pool limits
           │  (Layer 4)       │
           └──────────────────┘

Defense in depth: each layer catches what the previous missed.

Adaptive Admission Control

Instead of fixed thresholds, adapt based on observed performance:

package admission

import (
    "math"
    "sync"
    "time"
)

// AdaptiveThrottle implements Google's client-side throttling.
// Client tracks its own accept/reject ratio and proactively
// drops requests when the backend is clearly overloaded.
type AdaptiveThrottle struct {
    mu       sync.Mutex
    accepts  float64  // exponentially weighted accepts
    requests float64  // exponentially weighted total requests
    decay    float64  // decay factor (e.g., 0.95)
}

func (t *AdaptiveThrottle) RecordResult(accepted bool) {
    t.mu.Lock()
    defer t.mu.Unlock()
    
    t.requests = t.requests*t.decay + 1
    if accepted {
        t.accepts = t.accepts*t.decay + 1
    }
}

func (t *AdaptiveThrottle) ShouldThrottle() bool {
    t.mu.Lock()
    defer t.mu.Unlock()
    
    // Rejection probability: max(0, (requests - K*accepts) / (requests + 1))
    // K is a multiplier (typically 2.0) — allows 2× the accepted rate
    K := 2.0
    rejectionProb := math.Max(0, (t.requests - K*t.accepts) / (t.requests + 1))
    
    return rand.Float64() < rejectionProb
}

// When backend accepts 100% of requests → rejectionProb ≈ 0 → send everything
// When backend rejects 50% → client proactively drops ~33% → reduces load
// When backend rejects 90% → client drops ~80% → protects backend from total collapse

Queue-Based Admission Control

Instead of immediate reject, buffer requests in a bounded queue:

Incoming     ┌─────────────────────────┐     Workers
Requests ───▶│  Bounded Queue (size=N) │────▶ Processing
             └─────────────────────────┘
                      │
                      ▼ (when full)
                 429 Rejected

Queue advantages:
  - Smooths short bursts (microsecond spikes)
  - Maintains ordering (FIFO or priority)
  
Queue risks:
  - If queue drains slowly, requests sit for seconds
  - By the time you process them, the client has timed out
  - You do the work but the response goes nowhere → wasted capacity
  
Fix: Set a deadline on queued requests.
  If request has waited > 2 seconds in queue, drop it.
  The client has likely already timed out and retried.

Client-Side Behavior

Admission control works best when clients cooperate:

import time
import random

class RetryWithBackoff:
    """Client-side retry logic for 429/503 responses."""
    
    def request(self, url, max_retries=5):
        for attempt in range(max_retries):
            response = http.get(url)
            
            if response.status == 200:
                return response
            
            if response.status == 429:
                # Use server-provided Retry-After if available
                wait = int(response.headers.get("Retry-After", 0))
                if not wait:
                    # Exponential backoff with jitter
                    wait = min(2 ** attempt + random.uniform(0, 1), 30)
                time.sleep(wait)
                continue
            
            if response.status == 503:
                # Server overloaded — back off more aggressively
                wait = min(2 ** (attempt + 1) + random.uniform(0, 2), 60)
                time.sleep(wait)
                continue
            
            return response  # Other error, don't retry
        
        raise Exception("Max retries exceeded")

Key Takeaways

Reject early and clearly rather than degrading for everyone — a fast 429 is better than a slow timeout; rejected clients can retry while admitted requests get full-quality service
Layer admission control at multiple levels — WAF at the edge, rate limits at the gateway, concurrency limits in the application, connection limits at the database
Use priority-based shedding under load — critical paths (auth, health checks, paid tier) get priority; background and low-value work gets shed first
Adaptive throttling outperforms fixed thresholds — track accept/reject ratios and adjust admission probability dynamically; Google's approach reduces requests proportionally to backend rejection rate
Clients must cooperate with backoff — without exponential backoff and jitter on the client side, rejected requests create retry storms that amplify overload
Drop queued requests past their deadline — if a request has waited longer than the client's timeout, processing it wastes capacity; dequeue and discard stale work

What did you think?