Admission Control System Design: Protecting Systems from Overload
Admission Control System Design: Protecting Systems from Overload
Introduction
Admission control is the gatekeeper that decides whether to accept or reject incoming work based on the system's current capacity. Instead of accepting every request and degrading under load — slow responses, timeouts, cascading failures — admission control rejects excess requests upfront with a fast, clear error. The rejected client can retry or route elsewhere. The accepted clients get full-quality service.
The principle: it's better to serve 80% of requests well than 100% of requests poorly.
Why Admission Control Matters
Without admission control (overloaded system):
Request rate: 10,000 req/s
System capacity: 5,000 req/s
All 10,000 requests enter the system.
Each request gets half the resources it needs.
Response time: 200ms → 4,000ms
Timeouts cause retries → now 15,000 req/s
System collapses. Everyone gets errors.
With admission control:
Request rate: 10,000 req/s
System capacity: 5,000 req/s
5,000 requests admitted → served in 200ms ✓
5,000 requests rejected instantly with 429/503
Rejected clients retry after backoff
System stays healthy. Admitted requests get full quality.
Admission Control Strategies
1. Rate Limiting
Limit the number of requests per time window:
import time
from collections import defaultdict
class TokenBucketRateLimiter:
"""Token bucket: allows bursts up to bucket size, refills at steady rate."""
def __init__(self, rate, burst):
self.rate = rate # tokens per second
self.burst = burst # max tokens (bucket size)
self.tokens = burst
self.last_refill = time.monotonic()
def allow(self):
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
# Per-client rate limiting
client_limiters = defaultdict(lambda: TokenBucketRateLimiter(rate=100, burst=200))
def handle_request(request):
limiter = client_limiters[request.client_id]
if not limiter.allow():
return Response(status=429, body="Rate limit exceeded",
headers={"Retry-After": "1"})
return process(request)
2. Concurrency Limiting
Limit the number of in-flight requests:
import asyncio
class ConcurrencyLimiter:
"""Reject requests when too many are already in progress."""
def __init__(self, max_concurrent):
self.max_concurrent = max_concurrent
self.semaphore = asyncio.Semaphore(max_concurrent)
self.in_flight = 0
async def handle(self, request, handler):
if self.in_flight >= self.max_concurrent:
return Response(status=503, body="Server busy")
self.in_flight += 1
try:
return await handler(request)
finally:
self.in_flight -= 1
# If max_concurrent = 500 and 500 requests are being processed,
# request 501 gets an immediate 503 instead of queuing and waiting.
3. Load-Based Admission
Admit or reject based on real-time system load:
import psutil
class LoadBasedAdmission:
"""Reject requests when system resources are near capacity."""
def __init__(self, cpu_threshold=0.85, memory_threshold=0.90):
self.cpu_threshold = cpu_threshold
self.memory_threshold = memory_threshold
def should_admit(self):
cpu = psutil.cpu_percent(interval=None) / 100
memory = psutil.virtual_memory().percent / 100
if cpu > self.cpu_threshold:
return False, f"CPU at {cpu:.0%}"
if memory > self.memory_threshold:
return False, f"Memory at {memory:.0%}"
return True, "OK"
# More sophisticated: Google's CoDel-inspired approach
# Track request latency. If latency exceeds target for a sustained
# period, start dropping requests probabilistically.
4. Priority-Based Admission
Under load, admit high-priority requests and reject low-priority ones:
from enum import IntEnum
class Priority(IntEnum):
CRITICAL = 0 # Health checks, auth
HIGH = 1 # Paid tier API calls
NORMAL = 2 # Free tier API calls
LOW = 3 # Background syncs, analytics
class PriorityAdmission:
def __init__(self, capacity):
self.capacity = capacity
self.in_flight = 0
# At each load level, which priorities are admitted
self.thresholds = {
0.50: {Priority.CRITICAL, Priority.HIGH, Priority.NORMAL, Priority.LOW},
0.75: {Priority.CRITICAL, Priority.HIGH, Priority.NORMAL},
0.90: {Priority.CRITICAL, Priority.HIGH},
0.95: {Priority.CRITICAL},
}
def should_admit(self, priority):
load = self.in_flight / self.capacity
for threshold in sorted(self.thresholds.keys(), reverse=True):
if load >= threshold:
return priority in self.thresholds[threshold]
return True # Under 50% load, admit everything
# At 92% load:
# CRITICAL requests → admitted ✓
# HIGH requests → admitted ✓
# NORMAL requests → rejected (429)
# LOW requests → rejected (429)
Architecture: Where to Place Admission Control
Layer 1: Edge / CDN
Block bad actors, DDoS mitigation
WAF rules, IP rate limits
Layer 2: Load Balancer / API Gateway
Per-client rate limiting
Request routing, authentication
Connection limits
Layer 3: Application / Service
Concurrency limits
Load-based admission
Priority-based decisions
Queue depth limits
Layer 4: Dependencies (DB, cache, downstream)
Connection pool limits
Query timeout budgets
Circuit breakers
┌─────────────────────────────────────────┐
│ Internet │
└────────────────────┬────────────────────┘
▼
┌──────────────────┐
│ CDN / WAF │ ← IP rate limits, DDoS
│ (Layer 1) │
└────────┬─────────┘
▼
┌──────────────────┐
│ API Gateway │ ← Per-client rate limits
│ (Layer 2) │
└────────┬─────────┘
▼
┌──────────────────┐
│ Application │ ← Concurrency + priority
│ (Layer 3) │
└────────┬─────────┘
▼
┌──────────────────┐
│ Database │ ← Connection pool limits
│ (Layer 4) │
└──────────────────┘
Defense in depth: each layer catches what the previous missed.
Adaptive Admission Control
Instead of fixed thresholds, adapt based on observed performance:
package admission
import (
"math"
"sync"
"time"
)
// AdaptiveThrottle implements Google's client-side throttling.
// Client tracks its own accept/reject ratio and proactively
// drops requests when the backend is clearly overloaded.
type AdaptiveThrottle struct {
mu sync.Mutex
accepts float64 // exponentially weighted accepts
requests float64 // exponentially weighted total requests
decay float64 // decay factor (e.g., 0.95)
}
func (t *AdaptiveThrottle) RecordResult(accepted bool) {
t.mu.Lock()
defer t.mu.Unlock()
t.requests = t.requests*t.decay + 1
if accepted {
t.accepts = t.accepts*t.decay + 1
}
}
func (t *AdaptiveThrottle) ShouldThrottle() bool {
t.mu.Lock()
defer t.mu.Unlock()
// Rejection probability: max(0, (requests - K*accepts) / (requests + 1))
// K is a multiplier (typically 2.0) — allows 2× the accepted rate
K := 2.0
rejectionProb := math.Max(0, (t.requests - K*t.accepts) / (t.requests + 1))
return rand.Float64() < rejectionProb
}
// When backend accepts 100% of requests → rejectionProb ≈ 0 → send everything
// When backend rejects 50% → client proactively drops ~33% → reduces load
// When backend rejects 90% → client drops ~80% → protects backend from total collapse
Queue-Based Admission Control
Instead of immediate reject, buffer requests in a bounded queue:
Incoming ┌─────────────────────────┐ Workers
Requests ───▶│ Bounded Queue (size=N) │────▶ Processing
└─────────────────────────┘
│
▼ (when full)
429 Rejected
Queue advantages:
- Smooths short bursts (microsecond spikes)
- Maintains ordering (FIFO or priority)
Queue risks:
- If queue drains slowly, requests sit for seconds
- By the time you process them, the client has timed out
- You do the work but the response goes nowhere → wasted capacity
Fix: Set a deadline on queued requests.
If request has waited > 2 seconds in queue, drop it.
The client has likely already timed out and retried.
Client-Side Behavior
Admission control works best when clients cooperate:
import time
import random
class RetryWithBackoff:
"""Client-side retry logic for 429/503 responses."""
def request(self, url, max_retries=5):
for attempt in range(max_retries):
response = http.get(url)
if response.status == 200:
return response
if response.status == 429:
# Use server-provided Retry-After if available
wait = int(response.headers.get("Retry-After", 0))
if not wait:
# Exponential backoff with jitter
wait = min(2 ** attempt + random.uniform(0, 1), 30)
time.sleep(wait)
continue
if response.status == 503:
# Server overloaded — back off more aggressively
wait = min(2 ** (attempt + 1) + random.uniform(0, 2), 60)
time.sleep(wait)
continue
return response # Other error, don't retry
raise Exception("Max retries exceeded")
Key Takeaways
- Reject early and clearly rather than degrading for everyone — a fast 429 is better than a slow timeout; rejected clients can retry while admitted requests get full-quality service
- Layer admission control at multiple levels — WAF at the edge, rate limits at the gateway, concurrency limits in the application, connection limits at the database
- Use priority-based shedding under load — critical paths (auth, health checks, paid tier) get priority; background and low-value work gets shed first
- Adaptive throttling outperforms fixed thresholds — track accept/reject ratios and adjust admission probability dynamically; Google's approach reduces requests proportionally to backend rejection rate
- Clients must cooperate with backoff — without exponential backoff and jitter on the client side, rejected requests create retry storms that amplify overload
- Drop queued requests past their deadline — if a request has waited longer than the client's timeout, processing it wastes capacity; dequeue and discard stale work
What did you think?