The best system design education comes from studying real catastrophic failures. These incidents shaped how the industry thinks about reliability, redundancy, and operational excellence. Click any incident to dive deep.
On October 4, 2021, Facebook, Instagram, WhatsApp, and Messenger disappeared from the internet for nearly 7 hours. The cause: a routine BGP (Border Gateway Protocol) configuration change withdrawn all of Facebook's IP prefixes from the global internet routing table. Facebook ceased to exist from the internet's perspective.
Estimated $100M in lost advertising revenue. Zuckerberg's personal net worth dropped ~$7B during the outage. Facebook stock fell 4.9%. This also amplified trust issues after the concurrent Frances Haugen whistleblower revelations in the same week.
On Christmas Eve 2011 — one of Netflix's highest traffic days of the year — Amazon's US-EAST-1 region suffered a major EBS (Elastic Block Store) failure. Netflix was down for streaming in the US for hours on Christmas Eve. This incident directly catalyzed the creation of Chaos Monkey and Netflix's famous "chaos engineering" discipline.
Netflix's architecture at the time relied heavily on Amazon EBS volumes for database storage. When EBS in us-east-1 experienced a cascade failure (storage API calls timing out, then failing), Netflix's Cassandra nodes couldn't write to their underlying EBS volumes. The cascading failure spread: Cassandra replication backlogs grew, nodes fell behind and became inconsistent, the cluster degraded. Netflix's service discovery and autoscaling, also dependent on EBS-backed instances, also failed to recover properly.
Netflix published "Chaos Engineering" as a discipline and open-sourced the Simian Army tools. Principles of Chaos Engineering became an industry standard. Today, every major tech company has a chaos engineering practice. This single Christmas Eve outage arguably improved the resilience of the entire industry by normalizing the idea of intentional, controlled failure injection.
During network maintenance, GitHub's primary MySQL cluster experienced a 43-second network partition between the US-East datacenter and its primary database. When connectivity restored, a new leader had been elected in both partitions (split-brain). Both leaders accepted writes for 43 seconds. When the network healed, GitHub had two diverged copies of their primary database — a data consistency crisis.
Discord stored messages in Cassandra for years. By 2022, they had 177M monthly active users, billions of messages, and a Cassandra cluster that was showing serious performance problems. Latency spikes, garbage collection pauses, and hotspot partitions were causing user-visible degradation. This led to a multi-year engineering effort that culminated in a complete migration to ScyllaDB (a C++ reimplementation of Cassandra).
Discord's message data model: partition key = (channel_id, bucket) where bucket = month. Large Discord servers (e.g., game communities with 200K+ members) generate enormous message volume in a single channel. This creates a "hot partition" — a single Cassandra node that receives a disproportionate fraction of all traffic, becoming a bottleneck regardless of cluster size.
ScyllaDB is a C++ reimplementation of Cassandra's architecture. It eliminates the JVM garbage collection pauses that caused Cassandra's latency spikes. ScyllaDB uses a "shard-per-core" architecture — each CPU core owns a subset of data and never shares memory with other cores (no locking). This delivers more consistent, predictable latency. Discord benchmarked ScyllaDB at 3× better p99 latency with the same hardware.
On June 21, 2022, Cloudflare took 19 of its data centers offline simultaneously — including some of its highest traffic locations (São Paulo, London, Tokyo, Mumbai). The cause: a change to its BGP prefix announcement strategy, intended to optimize traffic routing, accidentally made those 19 locations unreachable. This cascaded into a massive traffic disruption affecting thousands of websites that rely on Cloudflare's CDN.
Cloudflare was changing its strategy for announcing IP prefixes to BGP peers. The new strategy was designed to route traffic more efficiently. However, the configuration change accidentally deaggregated their IP blocks in a way that caused many internet service providers (ISPs) to reject Cloudflare's routes due to filtering policies. The 19 affected data centers couldn't attract traffic because ISPs were dropping their route announcements.
On November 25, 2020, a capacity increase for the Amazon Kinesis service in US-East-1 triggered a cascading failure that took down not just Kinesis, but also Cognito, CloudWatch, Auto Scaling, and dozens of other AWS services that depended on Kinesis internally. This incident revealed the hidden coupling between AWS services that most users (and even AWS operators) weren't fully aware of.
AWS was adding capacity to the Kinesis front-end fleet. The new servers needed to connect to thousands of backend shards. This caused a large number of new threads to be created simultaneously, exceeding the operating system's thread count limits on the front-end servers. The front-end servers became unavailable. Then the cascade began: CloudWatch internally uses Kinesis for log streaming → CloudWatch failed. Auto Scaling uses CloudWatch metrics → Auto Scaling failed. Cognito (auth) uses CloudWatch → Cognito failed. Route 53 health checks use CloudWatch → health checks degraded.