💥 Production Incidents

Real-World Failures

The best system design education comes from studying real catastrophic failures. These incidents shaped how the industry thinks about reliability, redundancy, and operational excellence. Click any incident to dive deep.

Meta / Facebook

The BGP Route Withdrawal — 6 Hours of Total Darkness

P0 — Total Outage ⏱ 6 hours 28 minutes ~3.5B users affected Oct 4, 2021

On October 4, 2021, Facebook, Instagram, WhatsApp, and Messenger disappeared from the internet for nearly 7 hours. The cause: a routine BGP (Border Gateway Protocol) configuration change withdrawn all of Facebook's IP prefixes from the global internet routing table. Facebook ceased to exist from the internet's perspective.

Timeline

15:40 UTC — T+0
A configuration change is pushed to the backbone routers that coordinate network traffic between Facebook's data centers. The command was intended to assess the capacity of the global backbone.
15:41 UTC — T+1 min
The faulty command causes all BGP connections from Facebook's backbone to the internet to be severed. All 36 BGP prefixes (IP blocks) are withdrawn from global routing tables worldwide — Facebook disappears from the internet.
15:41-43 UTC
DNS servers that rely on Facebook's internal backbone for connectivity also go offline. The DNS resolvers that translate facebook.com to IP addresses stop responding. Pings to Facebook's nameservers fail globally.
15:45 UTC
Facebook's internal tools (Workplace, internal chat, access systems) are also down — they depend on the same infrastructure. Engineers cannot communicate. Physical access cards to data centers stop working.
16:00–21:00 UTC
Engineers scramble to physically access data centers (access cards are down, requiring physical security escorts). Once inside, the misconfigured backbone routers must be manually rolled back. The process is slow — the tools used to manage the routers are also offline.
22:00 UTC — T+6h28m
Services begin recovering globally as BGP routes are restored and DNS propagates. Full recovery takes another hour due to the "thundering herd" as billions of clients simultaneously try to reconnect.
Root Cause
  • BGP configuration audit tool had a bug that allowed a command to remove all BGP routes instead of individual prefixes
  • No circuit breaker on backbone config changes to detect "this will disconnect us from the internet"
  • Internal tooling co-located on the same infrastructure it manages — went down with the outage
Fixes & Lessons
  • Add verification step to BGP changes: "if applied, will this disconnect us from the internet?"
  • Out-of-band management network for critical infrastructure tools — separate from production
  • Emergency access procedures that don't depend on internal systems
  • Manual overrides and emergency contacts for data center physical access
System Design Insight The Facebook outage demonstrates that BGP (Border Gateway Protocol) is the routing protocol that makes the internet work — and it has almost no validation or safety mechanisms. A single misconfiguration can make a large portion of the internet unreachable. Designing for external network dependencies means assuming any of them can disappear suddenly and catastrophically.

Financial Impact

Estimated $100M in lost advertising revenue. Zuckerberg's personal net worth dropped ~$7B during the outage. Facebook stock fell 4.9%. This also amplified trust issues after the concurrent Frances Haugen whistleblower revelations in the same week.

Netflix

Christmas Eve AWS EBS Outage — The Birth of Chaos Engineering

P0 — Major Degradation ⏱ 8+ hours ~1M users affected Dec 24, 2011

On Christmas Eve 2011 — one of Netflix's highest traffic days of the year — Amazon's US-EAST-1 region suffered a major EBS (Elastic Block Store) failure. Netflix was down for streaming in the US for hours on Christmas Eve. This incident directly catalyzed the creation of Chaos Monkey and Netflix's famous "chaos engineering" discipline.

What Failed

Netflix's architecture at the time relied heavily on Amazon EBS volumes for database storage. When EBS in us-east-1 experienced a cascade failure (storage API calls timing out, then failing), Netflix's Cassandra nodes couldn't write to their underlying EBS volumes. The cascading failure spread: Cassandra replication backlogs grew, nodes fell behind and became inconsistent, the cluster degraded. Netflix's service discovery and autoscaling, also dependent on EBS-backed instances, also failed to recover properly.

Root Cause
  • Single-region deployment — all compute in us-east-1
  • Cassandra data stored on EBS volumes — single point of failure at the storage layer
  • No chaos testing — failures were only discovered in production under real conditions
  • Service dependencies not isolated — a storage failure cascaded to service discovery
How Netflix Responded
  • Moved Cassandra to instance-store (ephemeral SSD) — faster, no EBS dependency
  • Built Chaos Monkey — randomly terminates production instances to force resilience
  • Multi-region active-active — services distributed across us-east-1, us-west-2, eu-west-1
  • Hystrix circuit breakers on every service dependency
  • Graceful degradation — if recommendations fail, show popular titles instead
System Design Insight Netflix's most important post-mortem insight: "Failures that haven't been tested in production will eventually happen in production at the worst possible time." Chaos Monkey (now the Simian Army) injects failures constantly so Netflix engineers are always prepared. The philosophy: "Anything that can fail will fail. Build systems that degrade gracefully, not catastrophically." This is why Netflix tolerates more chaos than any other major company — they deliberately cause it.

Legacy: The Birth of Chaos Engineering

Netflix published "Chaos Engineering" as a discipline and open-sourced the Simian Army tools. Principles of Chaos Engineering became an industry standard. Today, every major tech company has a chaos engineering practice. This single Christmas Eve outage arguably improved the resilience of the entire industry by normalizing the idea of intentional, controlled failure injection.

GitHub

MySQL Cluster Partitioning — 24 Hours of Data Inconsistency

P0 — Data Inconsistency ⏱ 24+ hours All GitHub users Oct 21, 2018

During network maintenance, GitHub's primary MySQL cluster experienced a 43-second network partition between the US-East datacenter and its primary database. When connectivity restored, a new leader had been elected in both partitions (split-brain). Both leaders accepted writes for 43 seconds. When the network healed, GitHub had two diverged copies of their primary database — a data consistency crisis.

Timeline

22:52 UTC — Maintenance begins
Network equipment upgrade in US-East datacenter causes a brief network interruption to the MySQL primary cluster.
22:52 — 22:53 UTC (43 seconds)
MySQL cluster loses quorum. Orchestrator (GitHub's MySQL HA tool) promotes a replica to primary in a different datacenter. Now two primaries exist simultaneously — split brain. Both accept writes for 43 seconds.
22:53 UTC
Network restored. Two MySQL primaries with diverged data. GitHub detects the issue and immediately stops all write traffic to prevent further divergence. Site enters degraded read-only mode.
22:53 — 11:23 UTC+1 (12+ hours)
GitHub engineers write tooling to identify which rows diverged between the two copies. Using Vitess tooling, they carefully merge the diverged data, resolving conflicts row by row. Pull requests, issues, and comments written during the 43-second window are at risk.
Oct 22 11:23 UTC
Write traffic restored. Full recovery achieved after 24+ hours of engineer effort. Some data from the 43-second window could not be fully reconciled — GitHub issued a blog post disclosing the issue transparently.
Root Cause
  • Orchestrator's automatic failover promoted a new leader without verifying the old leader was truly dead
  • No mechanism to prevent two primaries accepting writes simultaneously (fencing tokens)
  • 43-second partition was long enough for automatic failover to trigger
Fixes
  • Require fencing tokens: old primary must be STONITH'd (Shoot The Other Node In The Head) before new primary accepts writes
  • Increase automatic failover delay to reduce false positives
  • Implement write blocking on demoted primaries via client certificate revocation
  • Better tooling to detect and alert on split-brain immediately
System Design Insight This incident is the textbook example of why "split-brain" in distributed databases is dangerous. The lesson: when a primary fails, you must be certain it has stopped accepting writes before promoting a replica. "Shoot The Other Node In The Head" (STONITH) is the industry term — physically or programmatically forcing the old primary to stop before allowing the new one to start. Raft consensus avoids this by design — only nodes that receive majority votes can become leader.
Discord

Cassandra Hot Partition — Why Discord Moved to ScyllaDB

P1 — Performance Degradation ⏱ Chronic (months) 177M+ users 2022–2023

Discord stored messages in Cassandra for years. By 2022, they had 177M monthly active users, billions of messages, and a Cassandra cluster that was showing serious performance problems. Latency spikes, garbage collection pauses, and hotspot partitions were causing user-visible degradation. This led to a multi-year engineering effort that culminated in a complete migration to ScyllaDB (a C++ reimplementation of Cassandra).

The Hot Partition Problem

Discord's message data model: partition key = (channel_id, bucket) where bucket = month. Large Discord servers (e.g., game communities with 200K+ members) generate enormous message volume in a single channel. This creates a "hot partition" — a single Cassandra node that receives a disproportionate fraction of all traffic, becoming a bottleneck regardless of cluster size.

The Hot Partition Math A popular Discord server with 200K members, with an active voice channel, generates 10,000 messages/hour in one channel. That entire month's messages hash to one Cassandra partition on one node. A 100-node cluster gives you zero benefit — that one node handles everything. No amount of horizontal scaling fixes a bad data model.

Why ScyllaDB?

ScyllaDB is a C++ reimplementation of Cassandra's architecture. It eliminates the JVM garbage collection pauses that caused Cassandra's latency spikes. ScyllaDB uses a "shard-per-core" architecture — each CPU core owns a subset of data and never shares memory with other cores (no locking). This delivers more consistent, predictable latency. Discord benchmarked ScyllaDB at 3× better p99 latency with the same hardware.

Cassandra Problems at Discord Scale
  • JVM garbage collection: "stop the world" GC pauses caused latency spikes of 100ms–2s
  • Hot partitions: large channels create uneven load on individual nodes
  • Tombstone accumulation: deleted messages leave tombstones that slow reads
  • Compaction pressure: background compaction competed with read/write latency
What Discord Did
  • Migrated 4PB of message data from Cassandra to ScyllaDB over 2 years
  • Used a "dual write" pattern: write to both, gradually shift reads
  • Fixed data model: added data locality hints to reduce hot partition severity
  • Result: 99th percentile latency improved from 40ms to 15ms, p999 from 250ms to 25ms
System Design Insight Database selection is not a one-time decision. At Discord's scale (billions of messages/day), the difference between Cassandra and ScyllaDB — same data model, same CQL API, same architecture — resulted in 3× better tail latency and millions of dollars saved in hardware. The lesson: don't just choose a database category (NoSQL wide-column); understand the implementation details (JVM GC vs C++ shard-per-core) that matter at your scale.
Cloudflare

Network Misconfiguration — 19 Data Centers Offline

P0 — Partial Global Outage ⏱ 1 hour 27 minutes ~50% of global traffic affected June 21, 2022

On June 21, 2022, Cloudflare took 19 of its data centers offline simultaneously — including some of its highest traffic locations (São Paulo, London, Tokyo, Mumbai). The cause: a change to its BGP prefix announcement strategy, intended to optimize traffic routing, accidentally made those 19 locations unreachable. This cascaded into a massive traffic disruption affecting thousands of websites that rely on Cloudflare's CDN.

What Happened

Cloudflare was changing its strategy for announcing IP prefixes to BGP peers. The new strategy was designed to route traffic more efficiently. However, the configuration change accidentally deaggregated their IP blocks in a way that caused many internet service providers (ISPs) to reject Cloudflare's routes due to filtering policies. The 19 affected data centers couldn't attract traffic because ISPs were dropping their route announcements.

06:27 UTC
Change deployed to production: new BGP prefix announcement strategy rolled out globally.
06:34 UTC (T+7 min)
19 data centers begin showing degraded traffic — significantly fewer requests than normal. Alarms trigger.
06:51 UTC (T+24 min)
Rollback initiated. However, BGP propagation is slow — retracting and re-announcing routes takes time to propagate globally (BGP convergence = 15-30 minutes).
07:54 UTC (T+1h27m)
All 19 data centers restored. Post-incident analysis begins.
Root Cause
  • BGP prefix deaggregation caused routes to be rejected by ISP filters (RPKI/IRR filtering)
  • Change was not tested against real ISP filtering policies before global rollout
  • No staged rollout — change deployed to all 19 affected locations simultaneously
Fixes
  • Staged BGP changes: deploy to 1 location → monitor → expand
  • Better simulation environment that mirrors real ISP filtering behavior
  • Automated verification: "Will this route announcement be accepted by our top 100 peers?"
  • Faster rollback: pre-staged rollback commands ready to execute
System Design Insight Cloudflare operates at global internet scale — any mistake can affect millions of websites simultaneously. The core lesson: "blast radius reduction through staged rollouts" applies even to infrastructure-level changes like BGP routing. Start with 1 location, wait 30 minutes, then expand. The cost of a slow rollout is operational delay (minutes). The cost of a simultaneous global change is a global outage (hours). The math always favors staged deployment.
Amazon Web Services

Kinesis Cascading Failure — Why Dependencies Matter

P0 — Cascading Outage ⏱ 8+ hours Thousands of AWS customers Nov 25, 2020

On November 25, 2020, a capacity increase for the Amazon Kinesis service in US-East-1 triggered a cascading failure that took down not just Kinesis, but also Cognito, CloudWatch, Auto Scaling, and dozens of other AWS services that depended on Kinesis internally. This incident revealed the hidden coupling between AWS services that most users (and even AWS operators) weren't fully aware of.

The Cascade Chain

AWS was adding capacity to the Kinesis front-end fleet. The new servers needed to connect to thousands of backend shards. This caused a large number of new threads to be created simultaneously, exceeding the operating system's thread count limits on the front-end servers. The front-end servers became unavailable. Then the cascade began: CloudWatch internally uses Kinesis for log streaming → CloudWatch failed. Auto Scaling uses CloudWatch metrics → Auto Scaling failed. Cognito (auth) uses CloudWatch → Cognito failed. Route 53 health checks use CloudWatch → health checks degraded.

Root Cause Chain
  • Kinesis capacity increase → thread limit exceeded on front-end servers
  • Kinesis failure → CloudWatch (internal dependency) degraded
  • CloudWatch failure → Auto Scaling, Cognito, Route 53 health checks degraded
  • Hidden internal dependencies between AWS services were not fully documented
Key Lessons
  • Capacity changes must be staged and monitored before global rollout
  • Service dependency maps must be maintained and tested (chaos engineering)
  • Critical services should have fallback modes when their dependencies fail
  • AWS now publishes internal service dependency information more transparently
System Design Insight Hidden dependencies are more dangerous than known ones. AWS's Kinesis outage took down services that had nothing to do with Kinesis — because they used it internally without users (or apparently some AWS teams) knowing. In your own systems: draw a dependency graph of every service. Ask: "If service X fails, what else breaks?" The answer is almost always "more than you think." Techniques: dependency injection for testability, circuit breakers at every dependency, graceful degradation when dependencies are unavailable.