Production Incidents — System Design

Meta / Facebook

The BGP Route Withdrawal — 6 Hours of Total Darkness

P0 — Total Outage ⏱ 6 hours 28 minutes ~3.5B users affected Oct 4, 2021

▼

On October 4, 2021, Facebook, Instagram, WhatsApp, and Messenger disappeared from the internet for nearly 7 hours. The cause: a routine BGP (Border Gateway Protocol) configuration change withdrawn all of Facebook's IP prefixes from the global internet routing table. Facebook ceased to exist from the internet's perspective.

Timeline

15:40 UTC — T+0

A configuration change is pushed to the backbone routers that coordinate network traffic between Facebook's data centers. The command was intended to assess the capacity of the global backbone.

15:41 UTC — T+1 min

The faulty command causes all BGP connections from Facebook's backbone to the internet to be severed. All 36 BGP prefixes (IP blocks) are withdrawn from global routing tables worldwide — Facebook disappears from the internet.

15:41-43 UTC

DNS servers that rely on Facebook's internal backbone for connectivity also go offline. The DNS resolvers that translate facebook.com to IP addresses stop responding. Pings to Facebook's nameservers fail globally.

15:45 UTC

Facebook's internal tools (Workplace, internal chat, access systems) are also down — they depend on the same infrastructure. Engineers cannot communicate. Physical access cards to data centers stop working.

16:00–21:00 UTC

Engineers scramble to physically access data centers (access cards are down, requiring physical security escorts). Once inside, the misconfigured backbone routers must be manually rolled back. The process is slow — the tools used to manage the routers are also offline.

22:00 UTC — T+6h28m

Services begin recovering globally as BGP routes are restored and DNS propagates. Full recovery takes another hour due to the "thundering herd" as billions of clients simultaneously try to reconnect.

Root Cause

BGP configuration audit tool had a bug that allowed a command to remove all BGP routes instead of individual prefixes
No circuit breaker on backbone config changes to detect "this will disconnect us from the internet"
Internal tooling co-located on the same infrastructure it manages — went down with the outage

Fixes & Lessons

Add verification step to BGP changes: "if applied, will this disconnect us from the internet?"
Out-of-band management network for critical infrastructure tools — separate from production
Emergency access procedures that don't depend on internal systems
Manual overrides and emergency contacts for data center physical access

System Design Insight The Facebook outage demonstrates that BGP (Border Gateway Protocol) is the routing protocol that makes the internet work — and it has almost no validation or safety mechanisms. A single misconfiguration can make a large portion of the internet unreachable. Designing for external network dependencies means assuming any of them can disappear suddenly and catastrophically.

Financial Impact

Estimated $100M in lost advertising revenue. Zuckerberg's personal net worth dropped ~$7B during the outage. Facebook stock fell 4.9%. This also amplified trust issues after the concurrent Frances Haugen whistleblower revelations in the same week.

Netflix

Christmas Eve AWS EBS Outage — The Birth of Chaos Engineering

P0 — Major Degradation ⏱ 8+ hours ~1M users affected Dec 24, 2011

▼

On Christmas Eve 2011 — one of Netflix's highest traffic days of the year — Amazon's US-EAST-1 region suffered a major EBS (Elastic Block Store) failure. Netflix was down for streaming in the US for hours on Christmas Eve. This incident directly catalyzed the creation of Chaos Monkey and Netflix's famous "chaos engineering" discipline.

What Failed

Netflix's architecture at the time relied heavily on Amazon EBS volumes for database storage. When EBS in us-east-1 experienced a cascade failure (storage API calls timing out, then failing), Netflix's Cassandra nodes couldn't write to their underlying EBS volumes. The cascading failure spread: Cassandra replication backlogs grew, nodes fell behind and became inconsistent, the cluster degraded. Netflix's service discovery and autoscaling, also dependent on EBS-backed instances, also failed to recover properly.

Root Cause

Single-region deployment — all compute in us-east-1
Cassandra data stored on EBS volumes — single point of failure at the storage layer
No chaos testing — failures were only discovered in production under real conditions
Service dependencies not isolated — a storage failure cascaded to service discovery

How Netflix Responded

Moved Cassandra to instance-store (ephemeral SSD) — faster, no EBS dependency
Built Chaos Monkey — randomly terminates production instances to force resilience
Multi-region active-active — services distributed across us-east-1, us-west-2, eu-west-1
Hystrix circuit breakers on every service dependency
Graceful degradation — if recommendations fail, show popular titles instead

System Design Insight Netflix's most important post-mortem insight: "Failures that haven't been tested in production will eventually happen in production at the worst possible time." Chaos Monkey (now the Simian Army) injects failures constantly so Netflix engineers are always prepared. The philosophy: "Anything that can fail will fail. Build systems that degrade gracefully, not catastrophically." This is why Netflix tolerates more chaos than any other major company — they deliberately cause it.

Legacy: The Birth of Chaos Engineering

Netflix published "Chaos Engineering" as a discipline and open-sourced the Simian Army tools. Principles of Chaos Engineering became an industry standard. Today, every major tech company has a chaos engineering practice. This single Christmas Eve outage arguably improved the resilience of the entire industry by normalizing the idea of intentional, controlled failure injection.

GitHub

MySQL Cluster Partitioning — 24 Hours of Data Inconsistency

P0 — Data Inconsistency ⏱ 24+ hours All GitHub users Oct 21, 2018

▼

During network maintenance, GitHub's primary MySQL cluster experienced a 43-second network partition between the US-East datacenter and its primary database. When connectivity restored, a new leader had been elected in both partitions (split-brain). Both leaders accepted writes for 43 seconds. When the network healed, GitHub had two diverged copies of their primary database — a data consistency crisis.

Timeline

22:52 UTC — Maintenance begins

Network equipment upgrade in US-East datacenter causes a brief network interruption to the MySQL primary cluster.

22:52 — 22:53 UTC (43 seconds)

MySQL cluster loses quorum. Orchestrator (GitHub's MySQL HA tool) promotes a replica to primary in a different datacenter. Now two primaries exist simultaneously — split brain. Both accept writes for 43 seconds.

22:53 UTC

Network restored. Two MySQL primaries with diverged data. GitHub detects the issue and immediately stops all write traffic to prevent further divergence. Site enters degraded read-only mode.

22:53 — 11:23 UTC+1 (12+ hours)

GitHub engineers write tooling to identify which rows diverged between the two copies. Using Vitess tooling, they carefully merge the diverged data, resolving conflicts row by row. Pull requests, issues, and comments written during the 43-second window are at risk.

Oct 22 11:23 UTC

Write traffic restored. Full recovery achieved after 24+ hours of engineer effort. Some data from the 43-second window could not be fully reconciled — GitHub issued a blog post disclosing the issue transparently.

Root Cause

Orchestrator's automatic failover promoted a new leader without verifying the old leader was truly dead
No mechanism to prevent two primaries accepting writes simultaneously (fencing tokens)
43-second partition was long enough for automatic failover to trigger

Fixes

Require fencing tokens: old primary must be STONITH'd (Shoot The Other Node In The Head) before new primary accepts writes
Increase automatic failover delay to reduce false positives
Implement write blocking on demoted primaries via client certificate revocation
Better tooling to detect and alert on split-brain immediately

System Design Insight This incident is the textbook example of why "split-brain" in distributed databases is dangerous. The lesson: when a primary fails, you must be certain it has stopped accepting writes before promoting a replica. "Shoot The Other Node In The Head" (STONITH) is the industry term — physically or programmatically forcing the old primary to stop before allowing the new one to start. Raft consensus avoids this by design — only nodes that receive majority votes can become leader.

Discord

Cassandra Hot Partition — Why Discord Moved to ScyllaDB

P1 — Performance Degradation ⏱ Chronic (months) 177M+ users 2022–2023

▼

Discord stored messages in Cassandra for years. By 2022, they had 177M monthly active users, billions of messages, and a Cassandra cluster that was showing serious performance problems. Latency spikes, garbage collection pauses, and hotspot partitions were causing user-visible degradation. This led to a multi-year engineering effort that culminated in a complete migration to ScyllaDB (a C++ reimplementation of Cassandra).

The Hot Partition Problem

Discord's message data model: partition key = (channel_id, bucket) where bucket = month. Large Discord servers (e.g., game communities with 200K+ members) generate enormous message volume in a single channel. This creates a "hot partition" — a single Cassandra node that receives a disproportionate fraction of all traffic, becoming a bottleneck regardless of cluster size.

The Hot Partition Math A popular Discord server with 200K members, with an active voice channel, generates 10,000 messages/hour in one channel. That entire month's messages hash to one Cassandra partition on one node. A 100-node cluster gives you zero benefit — that one node handles everything. No amount of horizontal scaling fixes a bad data model.

Why ScyllaDB?

ScyllaDB is a C++ reimplementation of Cassandra's architecture. It eliminates the JVM garbage collection pauses that caused Cassandra's latency spikes. ScyllaDB uses a "shard-per-core" architecture — each CPU core owns a subset of data and never shares memory with other cores (no locking). This delivers more consistent, predictable latency. Discord benchmarked ScyllaDB at 3× better p99 latency with the same hardware.

Cassandra Problems at Discord Scale

JVM garbage collection: "stop the world" GC pauses caused latency spikes of 100ms–2s
Hot partitions: large channels create uneven load on individual nodes
Tombstone accumulation: deleted messages leave tombstones that slow reads
Compaction pressure: background compaction competed with read/write latency

What Discord Did

Migrated 4PB of message data from Cassandra to ScyllaDB over 2 years
Used a "dual write" pattern: write to both, gradually shift reads
Fixed data model: added data locality hints to reduce hot partition severity
Result: 99th percentile latency improved from 40ms to 15ms, p999 from 250ms to 25ms

System Design Insight Database selection is not a one-time decision. At Discord's scale (billions of messages/day), the difference between Cassandra and ScyllaDB — same data model, same CQL API, same architecture — resulted in 3× better tail latency and millions of dollars saved in hardware. The lesson: don't just choose a database category (NoSQL wide-column); understand the implementation details (JVM GC vs C++ shard-per-core) that matter at your scale.

Cloudflare

Network Misconfiguration — 19 Data Centers Offline

P0 — Partial Global Outage ⏱ 1 hour 27 minutes ~50% of global traffic affected June 21, 2022

▼

On June 21, 2022, Cloudflare took 19 of its data centers offline simultaneously — including some of its highest traffic locations (São Paulo, London, Tokyo, Mumbai). The cause: a change to its BGP prefix announcement strategy, intended to optimize traffic routing, accidentally made those 19 locations unreachable. This cascaded into a massive traffic disruption affecting thousands of websites that rely on Cloudflare's CDN.

What Happened

Cloudflare was changing its strategy for announcing IP prefixes to BGP peers. The new strategy was designed to route traffic more efficiently. However, the configuration change accidentally deaggregated their IP blocks in a way that caused many internet service providers (ISPs) to reject Cloudflare's routes due to filtering policies. The 19 affected data centers couldn't attract traffic because ISPs were dropping their route announcements.

06:27 UTC

Change deployed to production: new BGP prefix announcement strategy rolled out globally.

06:34 UTC (T+7 min)

19 data centers begin showing degraded traffic — significantly fewer requests than normal. Alarms trigger.

06:51 UTC (T+24 min)

Rollback initiated. However, BGP propagation is slow — retracting and re-announcing routes takes time to propagate globally (BGP convergence = 15-30 minutes).

07:54 UTC (T+1h27m)

All 19 data centers restored. Post-incident analysis begins.

Root Cause

BGP prefix deaggregation caused routes to be rejected by ISP filters (RPKI/IRR filtering)
Change was not tested against real ISP filtering policies before global rollout
No staged rollout — change deployed to all 19 affected locations simultaneously

Fixes

Staged BGP changes: deploy to 1 location → monitor → expand
Better simulation environment that mirrors real ISP filtering behavior
Automated verification: "Will this route announcement be accepted by our top 100 peers?"
Faster rollback: pre-staged rollback commands ready to execute

System Design Insight Cloudflare operates at global internet scale — any mistake can affect millions of websites simultaneously. The core lesson: "blast radius reduction through staged rollouts" applies even to infrastructure-level changes like BGP routing. Start with 1 location, wait 30 minutes, then expand. The cost of a slow rollout is operational delay (minutes). The cost of a simultaneous global change is a global outage (hours). The math always favors staged deployment.

Amazon Web Services

Kinesis Cascading Failure — Why Dependencies Matter

P0 — Cascading Outage ⏱ 8+ hours Thousands of AWS customers Nov 25, 2020

▼

On November 25, 2020, a capacity increase for the Amazon Kinesis service in US-East-1 triggered a cascading failure that took down not just Kinesis, but also Cognito, CloudWatch, Auto Scaling, and dozens of other AWS services that depended on Kinesis internally. This incident revealed the hidden coupling between AWS services that most users (and even AWS operators) weren't fully aware of.

The Cascade Chain

AWS was adding capacity to the Kinesis front-end fleet. The new servers needed to connect to thousands of backend shards. This caused a large number of new threads to be created simultaneously, exceeding the operating system's thread count limits on the front-end servers. The front-end servers became unavailable. Then the cascade began: CloudWatch internally uses Kinesis for log streaming → CloudWatch failed. Auto Scaling uses CloudWatch metrics → Auto Scaling failed. Cognito (auth) uses CloudWatch → Cognito failed. Route 53 health checks use CloudWatch → health checks degraded.

Root Cause Chain

Kinesis capacity increase → thread limit exceeded on front-end servers
Kinesis failure → CloudWatch (internal dependency) degraded
CloudWatch failure → Auto Scaling, Cognito, Route 53 health checks degraded
Hidden internal dependencies between AWS services were not fully documented

Key Lessons

Capacity changes must be staged and monitored before global rollout
Service dependency maps must be maintained and tested (chaos engineering)
Critical services should have fallback modes when their dependencies fail
AWS now publishes internal service dependency information more transparently

System Design Insight Hidden dependencies are more dangerous than known ones. AWS's Kinesis outage took down services that had nothing to do with Kinesis — because they used it internally without users (or apparently some AWS teams) knowing. In your own systems: draw a dependency graph of every service. Ask: "If service X fails, what else breaks?" The answer is almost always "more than you think." Techniques: dependency injection for testability, circuit breakers at every dependency, graceful degradation when dependencies are unavailable.

Real-World Failures

The BGP Route Withdrawal — 6 Hours of Total Darkness

Timeline

Root Cause

Fixes & Lessons

Financial Impact

Christmas Eve AWS EBS Outage — The Birth of Chaos Engineering

What Failed

Root Cause

How Netflix Responded

Legacy: The Birth of Chaos Engineering

MySQL Cluster Partitioning — 24 Hours of Data Inconsistency

Timeline

Root Cause

Fixes

Cassandra Hot Partition — Why Discord Moved to ScyllaDB

The Hot Partition Problem

Why ScyllaDB?

Cassandra Problems at Discord Scale

What Discord Did

Network Misconfiguration — 19 Data Centers Offline

What Happened

Root Cause

Fixes

Kinesis Cascading Failure — Why Dependencies Matter

The Cascade Chain

Root Cause Chain

Key Lessons