Async Processing System Design: Decoupling Work from Request-Response Cycles

April 16, 202620 min read0 views

async processing

system design

message queues

event driven architecture

background jobs

scalability

distributed systems

performance optimization

Async Processing System Design: Decoupling Work from Request-Response Cycles

Introduction

Async processing means moving work out of the synchronous request-response path. Instead of making the user wait while you send an email, resize an image, or update a search index, you acknowledge the request immediately and process the work in the background. This improves response times, enables retry logic, absorbs traffic spikes, and lets you scale producers and consumers independently.

The core pattern: the request handler enqueues a task and returns immediately. A separate worker process dequeues and processes it later — "later" being milliseconds to minutes depending on the queue depth.

When to Go Async

Synchronous (keep in request path):
  ✅ Reading data the user needs in the response
  ✅ Validating input
  ✅ Deducting payment (user needs confirmation)
  ✅ Anything the response depends on

Asynchronous (move out of request path):
  ✅ Sending emails and notifications
  ✅ Image/video processing
  ✅ Updating search indexes
  ✅ Generating reports or PDFs
  ✅ Syncing data to analytics systems
  ✅ Calling slow third-party APIs
  ✅ Any work that can tolerate seconds-to-minutes delay

Architecture Patterns

Simple Producer-Consumer

┌──────────┐     ┌───────────┐     ┌──────────┐
│ API      │────▶│ Queue     │────▶│ Worker   │
│ Server   │     │ (SQS/     │     │ Process  │
│          │     │  RabbitMQ)│     │          │
│ Enqueue  │     │           │     │ Dequeue  │
│ task     │     │ Buffered  │     │ & process│
└──────────┘     └───────────┘     └──────────┘

API server returns 202 Accepted immediately.
Worker processes the task at its own pace.
Queue absorbs spikes — if 1000 requests arrive in 1 second,
the worker processes them at whatever rate it can sustain.

Fan-Out for Parallel Processing

Single event → Multiple consumers process different aspects:

Order placed event
  ├──▶ Queue: email-notifications → Worker: Send confirmation email
  ├──▶ Queue: inventory-updates  → Worker: Decrement stock
  ├──▶ Queue: analytics-events   → Worker: Update dashboards
  └──▶ Queue: search-index       → Worker: Update product availability

Each consumer is independently scalable.
Email worker slow? Add more email workers. Analytics worker doesn't
affect email delivery.

Task with Status Tracking

import uuid
import json
from datetime import datetime

class TaskQueue:
    def __init__(self, queue_client, db):
        self.queue = queue_client
        self.db = db
    
    def submit(self, task_type, payload):
        """Submit a task and return a task ID for status polling."""
        task_id = str(uuid.uuid4())
        
        # Store task metadata for status queries
        self.db.execute(
            "INSERT INTO tasks (id, type, status, created_at) VALUES (%s,%s,%s,%s)",
            (task_id, task_type, "pending", datetime.utcnow())
        )
        
        # Enqueue the task
        self.queue.send_message(json.dumps({
            "task_id": task_id,
            "type": task_type,
            "payload": payload,
        }))
        
        return task_id
    
    def get_status(self, task_id):
        """Check task progress."""
        row = self.db.query("SELECT status, result, error FROM tasks WHERE id=%s", (task_id,))
        return row

# API handler
def create_report(request):
    task_id = task_queue.submit("generate_report", {
        "user_id": request.user_id,
        "date_range": request.json["date_range"],
    })
    
    # Return immediately with a task ID
    return {"task_id": task_id, "status_url": f"/tasks/{task_id}"}, 202

# Client polls for status:
# GET /tasks/abc-123 → {"status": "processing", "progress": 45}
# GET /tasks/abc-123 → {"status": "complete", "result_url": "/reports/abc-123"}

Message Queue Patterns

At-Least-Once Delivery

Most queues guarantee at-least-once delivery — your worker might receive the same message twice:

class Worker:
    def process_message(self, message):
        task = json.loads(message.body)
        
        # Idempotency check: have we already processed this?
        if self.db.exists("processed_tasks", task["task_id"]):
            message.delete()  # Already done, acknowledge and skip
            return
        
        # Process the task
        result = self.do_work(task)
        
        # Mark as processed (idempotency key) BEFORE deleting from queue
        self.db.insert("processed_tasks", {
            "task_id": task["task_id"],
            "completed_at": datetime.utcnow(),
        })
        
        # Acknowledge (delete from queue)
        message.delete()

Dead Letter Queue (DLQ)

When a message fails repeatedly, move it to a DLQ for investigation:

Main Queue → Worker attempts processing
  → Failure (3 times) → Dead Letter Queue

DLQ contains poison messages that couldn't be processed.
An operator reviews them, fixes the bug, and replays them.

AWS SQS configuration:
{
  "maxReceiveCount": 3,
  "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:my-queue-dlq"
}

After 3 failed attempts, message moves to DLQ automatically.

Visibility Timeout and Acknowledgment

SQS model:
  1. Worker receives message (message becomes invisible to other workers)
  2. Visibility timeout = 30 seconds
  3. Worker processes the message
  4. Worker deletes the message (acknowledgment)
  
  If worker crashes before deleting:
    → After 30 seconds, message becomes visible again
    → Another worker picks it up (at-least-once delivery)
  
  If processing takes longer than 30 seconds:
    → Message becomes visible → ANOTHER worker picks it up
    → Now two workers process the same message!
    Fix: Extend visibility timeout during processing
    Or: Set visibility timeout > max expected processing time

Scaling Workers

Worker scaling strategy:
  Scale based on queue depth and processing rate.

  queue_depth = 10,000 messages
  processing_rate = 100 messages/second/worker
  target_drain_time = 60 seconds

  workers_needed = queue_depth / (processing_rate × target_drain_time)
  workers_needed = 10,000 / (100 × 60) ≈ 2 workers

  Kubernetes HPA can scale on custom metrics:
    - Queue depth (SQS ApproximateNumberOfMessages)
    - Processing latency
    - Consumer lag (Kafka)

# Kubernetes: scale workers based on SQS queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: task-worker
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: External
      external:
        metric:
          name: sqs_queue_depth
          selector:
            matchLabels:
              queue: task-queue
        target:
          type: AverageValue
          averageValue: "100"  # Target 100 messages per worker

Async Processing Pitfalls

1. Fire-and-forget without retry
   Task fails silently. Nobody knows. Data is lost.
   Fix: DLQ, monitoring on queue depth and error rate.

2. No idempotency
   Message delivered twice → user gets two emails, or payment charged twice.
   Fix: Idempotency keys. Check before processing.

3. Queue as a database
   Storing millions of messages for days → queue backs up → memory issues.
   Fix: Process messages promptly. Use a database for long-term storage.

4. No backpressure
   Producers enqueue faster than consumers can process → unbounded queue growth.
   Fix: Rate-limit producers, auto-scale consumers, set queue size limits.

5. Ordering assumptions
   SQS standard queues don't guarantee order.
   Processing message 3 before message 1 can cause incorrect state.
   Fix: Use FIFO queues when order matters, or design for out-of-order.

Key Takeaways

Move non-essential work out of the request path — return 202 Accepted immediately; process emails, reports, and indexing asynchronously
Queues decouple producers and consumers — producers and consumers scale independently; the queue absorbs traffic spikes
Design workers to be idempotent — at-least-once delivery means your worker WILL see duplicates; use idempotency keys to handle them safely
Dead letter queues catch poison messages — after N failures, park the message for investigation instead of retrying forever
Scale workers on queue depth — more messages waiting = more workers needed; use HPA with custom queue metrics
Provide task status for long-running work — return a task ID and a status endpoint so clients can poll for completion instead of waiting
Monitor queue depth and consumer lag — a growing queue means consumers can't keep up; alert before it becomes a problem

What did you think?