DEV Community
Grade 10
3d ago
From 500 to 50,000 Concurrent Users: Architecture Decisions That Actually Scale
TL;DR: Black Friday on a 2M-product platform. Before adding a single server, EXPLAIN ANALYZE found three queries doing sequential scans — fixing the indexes dropped query times from 800ms to 12ms. PgBouncer multiplexed 8,000 connections down to 150 and gave us 4x throughput. Redis cached the right three things and nothing else. Here's the full picture. Black Friday at 12,000 concurrent users on a platform you migrated from Magento six months ago is a different kind of stress test. The team that ran it before me had a sensible instinct when load spiked: add app servers. That instinct is almost always wrong, and I spent the three weeks before the event proving it. Start with the database, not the app servers Every scaling conversation jumps to horizontal app server scaling. The database is almost always the first bottleneck. Before touching anything else, run: -- Find slow queries in production (pg_stat_statements must be enabled) SELECT query , calls , mean_exec_time , total_exec_time , rows FROM pg_stat_statements WHERE mean_exec_time > 50 -- anything over 50ms is a candidate ORDER BY mean_exec_time DESC LIMIT 20 ; We found three queries. One of them, a product listing query joining categories and inventory: -- Before: full sequential scan, 800ms at scale EXPLAIN ANALYZE SELECT p . * , c . name AS category_name , i . quantity FROM products p JOIN categories c ON c . id = p . category_id JOIN inventory i ON i . product_id = p . id WHERE p . status = 'active' AND c . slug = 'electronics' ORDER BY p . created_at DESC LIMIT 50 ; -- Result: Seq Scan on products (cost=0.00..48234.12 rows=2089432) -- Execution Time: 847.392 ms The fix was a composite index that matched the filter and sort together: -- Covering index for the status + category_id + created_at access pattern CREATE INDEX CONCURRENTLY idx_products_status_category_created ON products ( status , category_id , created_at DESC ) INCLUDE ( id , title , price , slug ); -- Separate index for the categories join on slug CREATE INDEX CONCURRENTLY idx_categories_slug ON categories ( slug ); -- After: index scan, 12ms -- Result: Index Scan using idx_products_status_category_created -- Execution Time: 11.843 ms 800ms to 12ms. No new servers, no code changes, no cost increase. Index analysis before Black Friday is table stakes. Connection pooling is not optional At 12,000 concurrent users, your Node.js app servers will each maintain a connection pool. Ten ECS tasks, 100 connections each: you're already at 1,000 connections against Postgres. Scale to 20 tasks under load and PostgreSQL starts refusing connections entirely — it degrades sharply above roughly 200 active connections. The fix is a connection pooler sitting between your app tier and Postgres. We used PgBouncer in transaction pooling mode : # pgbouncer.ini [databases] production = host=your-rds-endpoint port=5432 dbname=production [pgbouncer] listen_addr = 0.0.0.0 listen_port = 5432 pool_mode = transaction # key: connection released after each transaction max_client_conn = 10000 # app-side connections (generous ceiling) default_pool_size = 150 # actual Postgres connections (stay under 200) reserve_pool_size = 10 reserve_pool_timeout = 5 server_idle_timeout = 600 log_connections = 0 log_disconnections = 0 Before PgBouncer: 8,000 app-side connections → Postgres melting. After PgBouncer: 8,000 app-side connections → 150 Postgres connections → 4x throughput increase. One caveat with transaction pooling mode: you cannot use SET statements, advisory locks, or LISTEN/NOTIFY across transactions — the connection you get back may not be the same one. Keep that in mind if you rely on session-level state. Redis: cache the right data and nothing else Redis is not a general caching layer. The wrong data in Redis creates distributed consistency problems that are much harder to debug than slow queries. We cached exactly three things: 1. Product page dataTTL: 5 minutes 2. User session dataTTL: 30 minutes (sliding) 3. Rate limit countersTTL: 1 minute (sliding window) We deliberately did not cache inventory counts. Here is why that matters: // Wrong: cached inventory leads to overselling async function getProductPage ( productId ) { const cached = await redis . get ( `product: ${ productId } ` ); if ( cached ) return JSON . parse ( cached ); // inventory count may be stale const product = await db . query ( `SELECT p.*, i.quantity FROM products p JOIN inventory i ON i.product_id = p.id WHERE p.id = $1` , [ productId ] ); await redis . setex ( `product: ${ productId } ` , 300 , JSON . stringify ( product . rows [ 0 ])); return product . rows [ 0 ]; } // Right: cache everything except the inventory count async function getProductPage ( productId ) { const [ cachedProduct , liveInventory ] = await Promise . all ([ redis . get ( `product: ${ productId } ` ), db . query ( `SELECT quantity FROM inventory WHERE product_id = $1` , [ productId ]) ]); const product = cachedProduct ? JSON . parse ( cachedProduct ) : await fetchAndCach
TL;DR: Black Friday on a 2M-product platform. Before adding a single server, EXPLAIN ANALYZE found three queries doing sequential scans — fixing the indexes dropped query times from 800ms to 12ms. PgBouncer multiplexed 8,000 connections down to 150 and gave us 4x throughput. Redis cached the right three things and nothing else. Here's the full picture. Black Friday at 12,000 concurrent users on a platform you migrated from Magento six months ago is a different kind of stress test. The team that ran it before me had a sensible instinct when load spiked: add app servers. That instinct is almost always wrong, and I spent the three weeks before the event proving it. Start with the database, not the app servers Every scaling conversation jumps to horizontal app server scaling. The database is almost always the first bottleneck. Before touching anything else, run: -- Find slow queries in production (pg_stat_statements must be enabled) SELECT query, calls, mean_exec_time, total_exec_time, rows FROM pg_stat_statements WHERE mean_exec_time > 50 -- anything over 50ms is a candidate ORDER BY mean_exec_time DESC LIMIT 20; We found three queries. One of them, a product listing query joining categories and inventory: -- Before: full sequential scan, 800ms at scale EXPLAIN ANALYZE SELECT p.*, c.name AS category_name, i.quantity FROM products p JOIN categories c ON c.id = p.category_id JOIN inventory i ON i.product_id = p.id WHERE p.status = 'active' AND c.slug = 'electronics' ORDER BY p.created_at DESC LIMIT 50; -- Result: Seq Scan on products (cost=0.00..48234.12 rows=2089432) -- Execution Time: 847.392 ms The fix was a composite index that matched the filter and sort together: -- Covering index for the status + category_id + created_at access pattern CREATE INDEX CONCURRENTLY idx_products_status_category_created ON products (status, category_id, created_at DESC) INCLUDE (id, title, price, slug); -- Separate index for the categories join on slug CREATE INDEX CONCURRENTLY idx_categories_slug ON categories (slug); -- After: index scan, 12ms -- Result: Index Scan using idx_products_status_category_created -- Execution Time: 11.843 ms 800ms to 12ms. No new servers, no code changes, no cost increase. Index analysis before Black Friday is table stakes. Connection pooling is not optional At 12,000 concurrent users, your Node.js app servers will each maintain a connection pool. Ten ECS tasks, 100 connections each: you're already at 1,000 connections against Postgres. Scale to 20 tasks under load and PostgreSQL starts refusing connections entirely — it degrades sharply above roughly 200 active connections. The fix is a connection pooler sitting between your app tier and Postgres. We used PgBouncer in transaction pooling mode: # pgbouncer.ini [databases] production = host=your-rds-endpoint port=5432 dbname=production [pgbouncer] listen_addr = 0.0.0.0 listen_port = 5432 pool_mode = transaction # key: connection released after each transaction max_client_conn = 10000 # app-side connections (generous ceiling) default_pool_size = 150 # actual Postgres connections (stay under 200) reserve_pool_size = 10 reserve_pool_timeout = 5 server_idle_timeout = 600 log_connections = 0 log_disconnections = 0 Before PgBouncer: 8,000 app-side connections → Postgres melting. After PgBouncer: 8,000 app-side connections → 150 Postgres connections → 4x throughput increase. One caveat with transaction pooling mode: you cannot use SET statements, advisory locks, or LISTEN/NOTIFY across transactions — the connection you get back may not be the same one. Keep that in mind if you rely on session-level state. Redis: cache the right data and nothing else Redis is not a general caching layer. The wrong data in Redis creates distributed consistency problems that are much harder to debug than slow queries. We cached exactly three things: 1. Product page data TTL: 5 minutes 2. User session data TTL: 30 minutes (sliding) 3. Rate limit counters TTL: 1 minute (sliding window) We deliberately did not cache inventory counts. Here is why that matters: // Wrong: cached inventory leads to overselling async function getProductPage(productId) { const cached = await redis.get(`product:${productId}`); if (cached) return JSON.parse(cached); // inventory count may be stale const product = await db.query( `SELECT p.*, i.quantity FROM products p JOIN inventory i ON i.product_id = p.id WHERE p.id = $1`, [productId] ); await redis.setex(`product:${productId}`, 300, JSON.stringify(product.rows[0])); return product.rows[0]; } // Right: cache everything except the inventory count async function getProductPage(productId) { const [cachedProduct, liveInventory] = await Promise.all([ redis.get(`product:${productId}`), db.query(`SELECT quantity FROM inventory WHERE product_id = $1`, [productId]) ]); const product = cachedProduct ? JSON.parse(cachedProduct) : await fetchAndCacheProduct(productId); // separate function return { ...product, quantity: liveInventory.rows[0].quantity }; } Stale stock counts cost conversions and cause overselling. Read inventory directly from the database, optimised by index. The indexed inventory read takes 3ms — that is a fine trade for correctness. Serverless for spiky, stateful for baseline Not everything needs to survive sustained load. We split the workload by its traffic shape: Kept on ECS (pre-warmed instances): - Checkout pipeline — latency-sensitive, stateful, must not cold-start - Product search — sustained load, stateful Elasticsearch client - Session API — low latency requirement, high call frequency Moved to Lambda: - Email notifications — spiky (order confirmation bursts), fully async - PDF receipt generation — CPU-intensive but async, nobody waits for it - Inventory update event processing — event-driven, SQS-triggered ECS (checkout, search, session) ← baseline traffic, pre-warmed Lambda (email, PDF, inventory) ← burst traffic, event-driven This cut our baseline EC2 spend by 35% while giving us effectively unlimited burst capacity for async workloads. The checkout service never competes for resources with a PDF generation spike. Load test before you need to Three weeks before Black Friday we ran k6 against a production-sized staging environment: // k6 load test script (simplified) import http from 'k6/http'; import { check, sleep } from 'k6'; export const options = { stages: [ { duration: '5m', target: 5000 }, // ramp up { duration: '10m', target: 15000 }, // hold at 15k (above our target) { duration: '5m', target: 0 }, // ramp down ], thresholds: { http_req_duration: ['p95 r.status === 200 }); sleep(1); } The test revealed one non-obvious problem: our CloudFront cache behaviour headers were wrong. Browsers were re-fetching product images on every page load because Cache-Control: no-cache had been set during development and never changed. Before fix: ~40% of image requests reached origin After fix: CDN cache hit rate → 94% One CloudFront distribution setting. Under actual Black Friday load, the CDN absorbed 94% of all traffic. Origin never saw the full load. What actually happened on Black Friday - Peak concurrent users: ~12,000 - p95 response time on product pages: 210ms - Error rate: 0.3% - Database connections at peak: 148 (PgBouncer keeping us safely under 150) - Zero rollbacks, zero incidents The database index changes, PgBouncer, and the CDN cache fix together did more work than any amount of horizontal scaling would have. We added zero new ECS tasks during the event. The checklist before your next traffic event - [ ] Run pg_stat_statements analysis — fix any query over 50ms - [ ] Deploy PgBouncer in transaction pooling mode between app and Postgres - [ ] Audit Redis cache keys — evict anything that must be consistent - [ ] Split traffic-shaped workloads: stateful/latency-sensitive on ECS, async/spiky on Lambda - [ ] Run a k6 load test at 125% of your peak estimate, three weeks out - [ ] Verify CDN cache headers with curl -I on key asset URLs - [ ] Set a CloudWatch alarm on Postgres connection count (alert at 80% of max_connections ) Originally published at https://shubhamkansal.com/blog/scaling-from-500-to-50000-concurrent-users. I'm Shubham Kansal, a freelance Full Stack & DevOps engineer — more at https://shubhamkansal.com. Top comments (0)
Comments