Surviving Black Friday: How We Scaled a Next.js E-Commerce Site to 50K Concurrent Users
The week before Black Friday, load tests showed the site would crash at 8K concurrent users. Here is every optimization we made to hit 50K.
Six days before Black Friday, our client's e-commerce site buckled at 8,000 concurrent users in load testing. They expected 40,000-50,000. This is the story of how we made it work.
First, we profiled. The bottleneck was not where anyone expected. The product listing pages were fine -- they were statically generated with ISR. The problem was the cart and checkout flow. Every add-to-cart action triggered a full server round-trip that recalculated the entire cart (prices, discounts, shipping estimates, tax) synchronously. At 8K users, the Node.js server was spending all its time on cart calculations.
Fix 1: We moved cart state to the client with optimistic updates. When a user adds an item, the UI updates immediately from local state. The server calculation happens in the background via a debounced API call. If the server disagrees (price changed, item out of stock), we reconcile. This cut perceived latency from 800ms to 40ms and reduced server load by 60%.
Fix 2: Redis caching for product data. Product prices, stock levels, and discount rules were being fetched from PostgreSQL on every cart calculation. We added a Redis layer with 30-second TTL. Cache hit rate was 94% during peak traffic. This alone dropped the p95 API response time from 320ms to 45ms.
Fix 3: Edge caching for product pages. We were already using ISR, but the revalidation period was 60 seconds. During Black Friday, inventory changes every few seconds. We switched to on-demand revalidation triggered by inventory webhook events. When stock drops below 10 units, the page revalidates immediately. Otherwise, the 60-second ISR continues. This prevented showing 'Add to Cart' on out-of-stock items while keeping CDN cache hit rates above 95%.
Fix 4: Database connection pooling. The original setup used a direct PostgreSQL connection per serverless function invocation. At scale, this exhausted the 100-connection limit within minutes. We added PgBouncer in transaction mode, which multiplexes hundreds of function invocations across 20 actual database connections. Connection errors dropped to zero.
Fix 5: Stripe webhook handling. The checkout flow called Stripe synchronously, waited for confirmation, then updated the database, then sent the confirmation email. We made it async: create a Stripe PaymentIntent, redirect to success page immediately, handle the webhook confirmation asynchronously. The user sees 'Order Confirmed' in 1.2 seconds instead of 4.5 seconds.
The result: on Black Friday, the site handled 52,000 concurrent users with a p95 page load of 1.1 seconds. Zero downtime, zero lost orders. The client's revenue was 3.2x the previous year's Black Friday.
The meta-lesson: performance optimization is almost never about rewriting code in a faster language. It is about finding the actual bottleneck (not the assumed one), then making the minimum change that removes it. Four of our five fixes were architectural changes, not code-level optimizations. Profile first, optimize second.
Let's discuss your project
15 minutes, no commitment.