GraphQL at Scale: Where It Shines and Where It Quietly Breaks You
GraphQL at Scale: Where It Shines and Where It Quietly Breaks You
Schema stitching vs federation, N+1 problems in resolvers, persisted queries, and the organizational complexity of schema ownership — the things nobody tells you when you adopt GraphQL.
The Seduction
GraphQL solves real problems. Clients request exactly what they need. No over-fetching. No under-fetching. Strongly typed. Self-documenting. One endpoint. Introspection for tooling.
For frontend developers, it's a revelation. For backend developers maintaining it at scale, it's a different story.
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE GRAPHQL REALITY CURVE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Complexity │
│ ▲ │
│ │ GraphQL │
│ │ (at scale) │
│ │ ┌───────────────────────── │
│ │ ────/ │
│ │ ────/ │
│ │ ────/ │
│ │ ────/ │
│ │ ────/ │
│ │ ────/ │
│ │ ────/ REST (at scale) │
│ │ ───/───────────────────────────────────────────── │
│ │/ │
│ │ GraphQL (small) │
│ └──────────────────────────────────────────────────────────▶ │
│ Scale │
│ │
│ At small scale, GraphQL is simpler than REST. │
│ At large scale, GraphQL introduces complexity REST doesn't have. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
This post is about that inflection point — and how to navigate it.
The N+1 Problem: Your First Crisis
Every GraphQL team discovers this within their first month in production.
The Problem
# Client query
query {
posts(first: 20) {
id
title
author {
name
avatar
}
comments(first: 5) {
text
user {
name
}
}
}
}
// Naive resolver implementation
const resolvers = {
Query: {
posts: () => db.posts.findMany({ take: 20 }),
},
Post: {
author: (post) => db.users.findUnique({ where: { id: post.authorId } }),
comments: (post) => db.comments.findMany({ where: { postId: post.id }, take: 5 }),
},
Comment: {
user: (comment) => db.users.findUnique({ where: { id: comment.userId } }),
},
};
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATABASE QUERIES GENERATED │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1 query: SELECT * FROM posts LIMIT 20 │
│ 20 queries: SELECT * FROM users WHERE id = ? (one per post author) │
│ 20 queries: SELECT * FROM comments WHERE post_id = ? LIMIT 5 │
│ 100 queries: SELECT * FROM users WHERE id = ? (one per comment user) │
│ ────────────────────────────────────────────────────────────────────── │
│ TOTAL: 141 database queries for ONE GraphQL query │
│ │
│ This scales linearly with result size. 100 posts = 701 queries. │
│ Your database will not forgive you. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
The Solution: DataLoader
DataLoader batches and caches requests within a single GraphQL execution:
import DataLoader from 'dataloader';
// Create loaders per-request (important!)
function createLoaders(db: Database) {
return {
userById: new DataLoader<string, User>(async (ids) => {
// Single batched query
const users = await db.users.findMany({
where: { id: { in: ids as string[] } },
});
// Return in same order as requested (DataLoader requirement)
const userMap = new Map(users.map(u => [u.id, u]));
return ids.map(id => userMap.get(id) ?? new Error(`User ${id} not found`));
}),
commentsByPostId: new DataLoader<string, Comment[]>(async (postIds) => {
const comments = await db.comments.findMany({
where: { postId: { in: postIds as string[] } },
take: 5, // This is wrong! Can't apply per-post limit here
});
// Group by postId
const commentMap = new Map<string, Comment[]>();
for (const comment of comments) {
const existing = commentMap.get(comment.postId) ?? [];
existing.push(comment);
commentMap.set(comment.postId, existing);
}
return postIds.map(id => commentMap.get(id) ?? []);
}),
};
}
// Resolvers use loaders from context
const resolvers = {
Post: {
author: (post, _, { loaders }) => loaders.userById.load(post.authorId),
comments: (post, _, { loaders }) => loaders.commentsByPostId.load(post.id),
},
Comment: {
user: (comment, _, { loaders }) => loaders.userById.load(comment.userId),
},
};
// Context factory
const server = new ApolloServer({
typeDefs,
resolvers,
context: ({ req }) => ({
loaders: createLoaders(db),
// New loaders for each request — critical for caching correctness
}),
});
┌─────────────────────────────────────────────────────────────────────────────┐
│ WITH DATALOADER │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1 query: SELECT * FROM posts LIMIT 20 │
│ 1 query: SELECT * FROM users WHERE id IN (?, ?, ?, ...) -- authors │
│ 1 query: SELECT * FROM comments WHERE post_id IN (?, ?, ...) -- all posts │
│ 1 query: SELECT * FROM users WHERE id IN (?, ?, ?, ...) -- commenters │
│ ────────────────────────────────────────────────────────────────────── │
│ TOTAL: 4 database queries │
│ │
│ From 141 to 4. This is why DataLoader is mandatory. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
The Limit Problem
Notice the bug in commentsByPostId? You can't apply per-entity limits in a batched query.
// This doesn't work:
// "Get 5 comments for each of these 20 posts"
// You'd get 5 comments total, not 5 per post
// Solution 1: Window functions (Postgres)
const commentsByPostId = new DataLoader<string, Comment[]>(async (postIds) => {
const comments = await db.$queryRaw`
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY post_id ORDER BY created_at DESC) as rn
FROM comments
WHERE post_id = ANY(${postIds}::uuid[])
) ranked
WHERE rn <= 5
`;
// ... group by postId
});
// Solution 2: Lateral joins (Postgres)
const comments = await db.$queryRaw`
SELECT c.* FROM unnest(${postIds}::uuid[]) AS pid
CROSS JOIN LATERAL (
SELECT * FROM comments
WHERE post_id = pid
ORDER BY created_at DESC
LIMIT 5
) c
`;
// Solution 3: Accept the over-fetch, filter in memory
const commentsByPostId = new DataLoader<string, Comment[]>(async (postIds) => {
const comments = await db.comments.findMany({
where: { postId: { in: postIds } },
orderBy: { createdAt: 'desc' },
// No limit — fetch all, then filter
});
const grouped = groupBy(comments, 'postId');
return postIds.map(id => (grouped[id] ?? []).slice(0, 5));
});
Schema Design: The Organizational Minefield
Schema Ownership
In REST, ownership is clear: the team that owns /users owns user endpoints. In GraphQL, you have one schema. Who owns it?
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE SCHEMA OWNERSHIP PROBLEM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ GraphQL Schema (one file, or feels like one) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ type User { ← Who owns this? │ │
│ │ id: ID! - Identity team (core fields) │ │
│ │ email: String! - Identity team │ │
│ │ profile: Profile! - Profile team │ │
│ │ orders: [Order!]! - Commerce team │ │
│ │ recommendations: [Product!]! ← ML team │ │
│ │ subscription: Subscription ← Billing team │ │
│ │ } │ │
│ │ │ │
│ │ 5 teams contribute to ONE type. │ │
│ │ Every change requires coordination. │ │
│ │ Breaking changes break everyone. │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ In REST, each team has their own endpoints. No coordination needed. │
│ In GraphQL, schema changes are cross-team by default. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Approach 1: Schema Stitching (Legacy)
Combine multiple schemas at the gateway level:
// Gateway server
import { stitchSchemas } from '@graphql-tools/stitch';
const gatewaySchema = stitchSchemas({
subschemas: [
{
schema: await fetchRemoteSchema('http://users-service/graphql'),
executor: createRemoteExecutor('http://users-service/graphql'),
},
{
schema: await fetchRemoteSchema('http://products-service/graphql'),
executor: createRemoteExecutor('http://products-service/graphql'),
},
{
schema: await fetchRemoteSchema('http://orders-service/graphql'),
executor: createRemoteExecutor('http://orders-service/graphql'),
},
],
typeMergingConfig: {
User: {
selectionSet: '{ id }',
fieldName: 'user',
args: ({ id }) => ({ id }),
},
},
});
Problems with stitching:
- Gateway must understand all subschemas
- Gateway is a bottleneck for changes
- Type merging is gateway logic, not service logic
- Hard to evolve independently
Approach 2: Apollo Federation (Current Standard)
Services declare how they contribute to types:
# users-service/schema.graphql
type User @key(fields: "id") {
id: ID!
email: String!
name: String!
}
type Query {
user(id: ID!): User
me: User
}
# orders-service/schema.graphql
type User @key(fields: "id") @extends {
id: ID! @external
orders: [Order!]!
}
type Order @key(fields: "id") {
id: ID!
total: Float!
items: [OrderItem!]!
}
type Query {
order(id: ID!): Order
}
# recommendations-service/schema.graphql
type User @key(fields: "id") @extends {
id: ID! @external
recommendations: [Product!]!
}
type Product @key(fields: "id") {
id: ID!
name: String!
price: Float!
}
┌─────────────────────────────────────────────────────────────────────────────┐
│ FEDERATION QUERY FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Client Query: │
│ query { │
│ user(id: "123") { │
│ name ← users-service │
│ orders { total } ← orders-service │
│ recommendations { name } ← recommendations-service │
│ } │
│ } │
│ │
│ Gateway (Apollo Router) orchestrates: │
│ │
│ 1. Query users-service: │
│ query { user(id: "123") { id name } } │
│ │
│ 2. In parallel: │
│ orders-service: │
│ query { _entities(representations: [{__typename: "User", id: "123"}])│
│ { ... on User { orders { total } } } } │
│ │
│ recommendations-service: │
│ query { _entities(representations: [{__typename: "User", id: "123"}])│
│ { ... on User { recommendations { name } } } } │
│ │
│ 3. Merge results │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Federation Trade-offs
┌─────────────────────────────────────────────────────────────────────────────┐
│ FEDERATION PROS & CONS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PROS: │
│ ───── │
│ ✓ Services own their schema contributions │
│ ✓ Services can be deployed independently │
│ ✓ Gateway is "dumb" — just routes queries │
│ ✓ Type ownership is explicit (@key, @extends) │
│ ✓ Parallel execution of subgraph queries │
│ │
│ CONS: │
│ ───── │
│ ✗ Added network hops (gateway → services) │
│ ✗ Debugging is harder (distributed tracing required) │
│ ✗ Schema composition can fail at deploy time │
│ ✗ Performance overhead for cross-service queries │
│ ✗ @key must be efficiently resolvable (N+1 at gateway level) │
│ ✗ Apollo Router/Gateway is a critical dependency │
│ ✗ _entities resolver in every service │
│ │
│ HIDDEN COST: │
│ Query that touches 4 services = 4+ network round trips │
│ Each round trip adds latency (p99 can explode) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
The _entities Resolver
Every federated service must implement this efficiently:
// orders-service/resolvers.ts
const resolvers = {
Query: {
order: (_, { id }) => db.orders.findUnique({ where: { id } }),
},
User: {
// This is called for EVERY user that needs orders
// Gateway sends: [{__typename: "User", id: "1"}, {__typename: "User", id: "2"}, ...]
__resolveReference: async (reference, { loaders }) => {
// Must batch! Otherwise N+1 at gateway level
const orders = await loaders.ordersByUserId.load(reference.id);
return { ...reference, orders };
},
},
// Alternative: resolve orders field directly
User: {
orders: (user, _, { loaders }) => {
return loaders.ordersByUserId.load(user.id);
},
},
};
// DataLoader is critical here too
function createLoaders(db: Database) {
return {
ordersByUserId: new DataLoader<string, Order[]>(async (userIds) => {
const orders = await db.orders.findMany({
where: { userId: { in: userIds as string[] } },
});
const grouped = groupBy(orders, 'userId');
return userIds.map(id => grouped[id] ?? []);
}),
};
}
Persisted Queries: Security and Performance
The Problem with Arbitrary Queries
# Malicious query
query {
users(first: 1000) {
posts(first: 1000) {
comments(first: 1000) {
replies(first: 1000) {
author {
posts(first: 1000) {
comments(first: 1000) {
# ... infinite nesting
}
}
}
}
}
}
}
}
This single query could fetch millions of records.
Query Complexity Analysis
import { createComplexityLimitRule } from 'graphql-validation-complexity';
const complexityRule = createComplexityLimitRule(1000, {
scalarCost: 1,
objectCost: 10,
listFactor: 10,
// Custom costs per field
fieldCost: (field, args) => {
if (field.name === 'recommendations') {
return 100; // ML inference is expensive
}
if (args.first && args.first > 100) {
return Infinity; // Reject large first arguments
}
return undefined; // Use default
},
});
const server = new ApolloServer({
typeDefs,
resolvers,
validationRules: [complexityRule],
});
┌─────────────────────────────────────────────────────────────────────────────┐
│ COMPLEXITY CALCULATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ query { │
│ users(first: 10) { # 10 × (1 + children) │
│ name # 1 (scalar) │
│ posts(first: 5) { # 5 × (1 + children) │
│ title # 1 (scalar) │
│ comments(first: 10) { # 10 × (1 + children) │
│ text # 1 (scalar) │
│ } │
│ } │
│ } │
│ } │
│ │
│ Calculation: │
│ comments: 10 × (10 + 1) = 110 │
│ posts: 5 × (10 + 1 + 110) = 605 │
│ users: 10 × (10 + 1 + 605) = 6160 │
│ │
│ Total complexity: 6160 │
│ If limit is 1000, query is rejected. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Persisted Queries
The nuclear option: only allow pre-approved queries.
// Build time: extract queries from client code
// Usually via babel/webpack plugin
const extractedQueries = {
'abc123': 'query GetUser($id: ID!) { user(id: $id) { name email } }',
'def456': 'query GetPosts($first: Int!) { posts(first: $first) { title } }',
// ...
};
// Generate query manifest
fs.writeFileSync(
'query-manifest.json',
JSON.stringify(extractedQueries)
);
// Server: only execute registered queries
const queryManifest = JSON.parse(fs.readFileSync('query-manifest.json'));
const server = new ApolloServer({
typeDefs,
resolvers,
persistedQueries: {
cache: new InMemoryLRUCache(),
},
plugins: [
{
async requestDidStart({ request }) {
// If using APQ (Automatic Persisted Queries)
if (request.extensions?.persistedQuery) {
const hash = request.extensions.persistedQuery.sha256Hash;
if (!queryManifest[hash]) {
throw new ForbiddenError('Query not in allowlist');
}
}
// In strict mode, reject all non-persisted queries
if (!request.extensions?.persistedQuery && process.env.NODE_ENV === 'production') {
throw new ForbiddenError('Only persisted queries allowed');
}
},
},
],
});
Automatic Persisted Queries (APQ) flow:
┌─────────────────────────────────────────────────────────────────────────────┐
│ APQ PROTOCOL │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ First request (query not cached): │
│ ───────────────────────────────── │
│ 1. Client: POST { extensions: { persistedQuery: { sha256Hash: "abc..." }}} │
│ 2. Server: 404 "PersistedQueryNotFound" │
│ 3. Client: POST { query: "query {...}", extensions: { persistedQuery... }} │
│ 4. Server: Caches query, returns result │
│ │
│ Subsequent requests (query cached): │
│ ──────────────────────────────────── │
│ 1. Client: POST { extensions: { persistedQuery: { sha256Hash: "abc..." }}} │
│ 2. Server: Looks up hash, executes cached query, returns result │
│ │
│ Benefits: │
│ • Smaller request payloads (hash instead of query string) │
│ • CDN caching possible (hash as cache key) │
│ • Query allowlisting for security │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Caching: The Unsolved Problem
Why GraphQL Caching is Hard
┌─────────────────────────────────────────────────────────────────────────────┐
│ REST vs GRAPHQL CACHING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ REST: │
│ GET /users/123 │
│ Cache-Control: max-age=3600 │
│ → Cache this URL for 1 hour │
│ │
│ Simple. URL is cache key. Headers control caching. │
│ CDN caches naturally. │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ GraphQL: │
│ POST /graphql │
│ Body: { query: "{ user(id: 123) { name email orders { total } } }" } │
│ │
│ Problems: │
│ 1. POST requests aren't cached by CDNs by default │
│ 2. Same "user" query with different field selections = different caches │
│ 3. Response contains multiple entities with different cache policies │
│ 4. Partial cache hits? { name: cached, orders: fresh } │
│ 5. Cache invalidation across nested entities │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Response Caching (Full Query)
// Cache full query responses
import { responseCachePlugin } from '@apollo/server-plugin-response-cache';
const server = new ApolloServer({
typeDefs,
resolvers,
plugins: [
responseCachePlugin({
// Per-type cache hints
sessionId: (ctx) => ctx.request.http?.headers.get('authorization') ?? null,
}),
],
});
// In schema, declare cache hints
const typeDefs = gql`
type User @cacheControl(maxAge: 3600) {
id: ID!
name: String!
email: String! @cacheControl(maxAge: 0) # PII - no caching
publicProfile: Profile @cacheControl(maxAge: 3600)
orders: [Order!]! @cacheControl(maxAge: 60) # Changes more often
}
type Query {
user(id: ID!): User
publicPosts: [Post!]! @cacheControl(maxAge: 300, scope: PUBLIC)
}
`;
The problem: Cache hints are minimum of all fields. One maxAge: 0 field makes the whole response uncacheable.
Entity Caching (Normalized)
Store entities independently, reconstruct responses:
// Conceptual normalized cache
const cache = {
'User:123': { id: '123', name: 'Alice', __typename: 'User' },
'User:456': { id: '456', name: 'Bob', __typename: 'User' },
'Post:1': { id: '1', title: 'Hello', authorId: '123', __typename: 'Post' },
'Post:2': { id: '2', title: 'World', authorId: '456', __typename: 'Post' },
};
// Query: { posts { title author { name } } }
// Can be reconstructed from cache without network request
// IF all required fields are cached
Apollo Client does this client-side. Server-side is harder because:
- Multiple clients, shared cache
- Cache invalidation across services
- Partial cache reconstruction complexity
Practical Server-Side Caching
// Cache at the resolver level
const resolvers = {
Query: {
user: async (_, { id }, { cache }) => {
const cacheKey = `user:${id}`;
const cached = await cache.get(cacheKey);
if (cached) return JSON.parse(cached);
const user = await db.users.findUnique({ where: { id } });
await cache.set(cacheKey, JSON.stringify(user), { ttl: 3600 });
return user;
},
},
User: {
orders: async (user, _, { cache, loaders }) => {
// Don't cache here — DataLoader already batches
// Caching in nested resolvers gets complicated fast
return loaders.ordersByUserId.load(user.id);
},
},
};
// Cache invalidation on mutation
const resolvers = {
Mutation: {
updateUser: async (_, { id, input }, { cache }) => {
const user = await db.users.update({ where: { id }, data: input });
// Invalidate user cache
await cache.del(`user:${id}`);
// What about cached query responses that include this user?
// What about other services that cached this user?
// This is where it gets hard.
return user;
},
},
};
Error Handling
The GraphQL Error Model
GraphQL returns 200 OK even when things fail. Errors are in the response body.
{
"data": {
"user": {
"name": "Alice",
"orders": null
}
},
"errors": [
{
"message": "Failed to fetch orders",
"path": ["user", "orders"],
"extensions": {
"code": "DOWNSTREAM_SERVICE_ERROR"
}
}
]
}
Partial failures are normal. The orders field failed, but name succeeded.
Error Design Patterns
// Pattern 1: Union types for expected errors
const typeDefs = gql`
type User {
id: ID!
name: String!
}
type UserNotFoundError {
message: String!
userId: ID!
}
type PermissionDeniedError {
message: String!
requiredRole: String!
}
union UserResult = User | UserNotFoundError | PermissionDeniedError
type Query {
user(id: ID!): UserResult!
}
`;
// Resolver
const resolvers = {
Query: {
user: async (_, { id }, { currentUser }) => {
if (!currentUser) {
return {
__typename: 'PermissionDeniedError',
message: 'Authentication required',
requiredRole: 'USER',
};
}
const user = await db.users.findUnique({ where: { id } });
if (!user) {
return {
__typename: 'UserNotFoundError',
message: `User ${id} not found`,
userId: id,
};
}
return { __typename: 'User', ...user };
},
},
UserResult: {
__resolveType: (obj) => obj.__typename,
},
};
// Client query
const query = gql`
query GetUser($id: ID!) {
user(id: $id) {
... on User {
name
}
... on UserNotFoundError {
message
userId
}
... on PermissionDeniedError {
message
requiredRole
}
}
}
`;
Error Classification
// Extend Apollo's error classes
import { GraphQLError, ApolloServerErrorCode } from '@apollo/server/errors';
// User errors (client can fix)
class ValidationError extends GraphQLError {
constructor(message: string, field: string) {
super(message, {
extensions: {
code: 'VALIDATION_ERROR',
field,
},
});
}
}
// System errors (client can't fix)
class DownstreamError extends GraphQLError {
constructor(service: string, originalError: Error) {
super(`Service ${service} unavailable`, {
extensions: {
code: 'DOWNSTREAM_SERVICE_ERROR',
service,
// Don't expose internal error details in production
...(process.env.NODE_ENV === 'development' && {
originalMessage: originalError.message,
}),
},
});
}
}
// Error formatting
const server = new ApolloServer({
typeDefs,
resolvers,
formatError: (formattedError, error) => {
// Log full error internally
logger.error('GraphQL Error', {
message: formattedError.message,
code: formattedError.extensions?.code,
path: formattedError.path,
originalError: error,
});
// In production, hide internal errors
if (
process.env.NODE_ENV === 'production' &&
formattedError.extensions?.code === 'INTERNAL_SERVER_ERROR'
) {
return {
message: 'Internal server error',
extensions: { code: 'INTERNAL_SERVER_ERROR' },
};
}
return formattedError;
},
});
When GraphQL Actually Shines
┌─────────────────────────────────────────────────────────────────────────────┐
│ GRAPHQL IS THE RIGHT CHOICE WHEN │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ✓ Multiple clients with different data needs │
│ (mobile needs fewer fields than web) │
│ │
│ ✓ Rapid frontend iteration │
│ (frontend can change queries without backend deploys) │
│ │
│ ✓ Complex, interconnected data model │
│ (social graphs, e-commerce catalogs) │
│ │
│ ✓ Need for strong typing and documentation │
│ (schema is the contract) │
│ │
│ ✓ Aggregating multiple data sources │
│ (federation, gateway pattern) │
│ │
│ ✓ Team has frontend-heavy needs │
│ (GraphQL optimizes for client developers) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
When GraphQL Quietly Breaks You
┌─────────────────────────────────────────────────────────────────────────────┐
│ GRAPHQL IS THE WRONG CHOICE WHEN │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ✗ Simple CRUD with few clients │
│ (REST is simpler, HTTP caching works) │
│ │
│ ✗ File uploads are primary use case │
│ (GraphQL handles this poorly) │
│ │
│ ✗ Real-time requirements │
│ (subscriptions work but add complexity; consider WebSockets directly) │
│ │
│ ✗ Microservices without API gateway investment │
│ (federation requires significant infrastructure) │
│ │
│ ✗ Performance-critical, low-latency requirements │
│ (parsing, validation, resolution overhead) │
│ │
│ ✗ Team is backend-heavy with little frontend ownership │
│ (GraphQL's benefits accrue to frontend) │
│ │
│ ✗ Caching is critical and must be HTTP-based │
│ (CDN caching is harder with GraphQL) │
│ │
│ ✗ You need to move fast with a small team │
│ (GraphQL has higher upfront and ongoing cost) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Production Checklist
┌─────────────────────────────────────────────────────────────────────────────┐
│ GRAPHQL AT SCALE CHECKLIST │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PERFORMANCE: │
│ ──────────── │
│ □ DataLoader in every service (N+1 prevention) │
│ □ Query complexity limits │
│ □ Query depth limits │
│ □ Timeout per resolver │
│ □ Response size limits │
│ □ Pagination required on all lists (no unbounded arrays) │
│ │
│ SECURITY: │
│ ───────── │
│ □ Persisted queries in production (or complexity limits) │
│ □ Introspection disabled in production │
│ □ Field-level authorization │
│ □ Rate limiting per client/operation │
│ □ Input validation on all arguments │
│ │
│ OBSERVABILITY: │
│ ────────────── │
│ □ Distributed tracing (trace ID through federation) │
│ □ Per-resolver timing metrics │
│ □ Error tracking with path context │
│ □ Query logging (sanitized for PII) │
│ □ Slow query alerting │
│ │
│ SCHEMA GOVERNANCE: │
│ ────────────────── │
│ □ Schema linting in CI │
│ □ Breaking change detection │
│ □ Schema registry (for federation) │
│ □ Deprecation policy and timeline │
│ □ Field usage analytics │
│ │
│ OPERATIONS: │
│ ─────────── │
│ □ Health checks for all subgraphs │
│ □ Graceful degradation (partial responses) │
│ □ Circuit breakers for downstream services │
│ □ Rollback plan for schema changes │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Summary
GraphQL is a powerful tool that solves real problems. It's also a complex system that introduces challenges REST doesn't have.
The teams that succeed with GraphQL at scale:
- Invest in DataLoader from day one — N+1 will bite you immediately
- Treat schema as a product — ownership, governance, versioning
- Implement query complexity limits — unbounded queries will take you down
- Build observability early — debugging distributed GraphQL is hard
- Accept partial responses — errors in one field shouldn't fail the whole query
- Plan for caching complexity — HTTP caching doesn't work the same way
The teams that struggle:
- Adopt GraphQL because "it's what big companies use"
- Ignore N+1 until production is on fire
- Let the schema grow organically without ownership
- Expose GraphQL directly to untrusted clients without limits
- Expect REST-like caching to just work
GraphQL isn't better or worse than REST. It's a different set of trade-offs. Know what you're trading for, and make sure it's worth it for your specific situation.
The best API technology is the one that solves your actual problems without creating bigger ones. Sometimes that's GraphQL. Sometimes it's REST. Sometimes it's both.
What did you think?