Module 08

DevOps &
Infrastructure

How modern systems are deployed, scaled, monitored, and kept alive in production.

DevOps 01

Docker & Containers

A container packages an application with all its dependencies into a single portable unit. Unlike VMs, containers share the host OS kernel — making them lightweight (MB vs GB) and fast to start (ms vs seconds).

A VM is like shipping your entire house (OS, libraries, app). A container is like a shipping container — standardised box with just your goods inside. The ship (host OS) handles the physical layer. Containers are interchangeable and stack efficiently.

Docker Architecture

Dockerfile → docker build → Docker Image (immutable layers)
Docker Image → docker run → Container (running instance)

Layers (each instruction creates a layer):
  FROM ubuntu:22.04        # Base layer (cached)
  RUN apt install python   # Python layer (cached if unchanged)
  COPY app.py /app/        # App layer (rebuilt on change)
  CMD ["python", "app.py"] # Entrypoint

Layer caching = fast builds. Put frequently changed layers last.

Why Containers Matter for System Design

Environment consistency: "Works on my machine" is eliminated. Same container runs in dev, staging, production.
Fast scaling: Spin up a new container in seconds vs minutes for a VM.
Resource efficiency: Pack more workloads onto the same hardware.
Immutable infrastructure: Never patch a running container — rebuild and redeploy. Rollback = run old image.

❦

DevOps 02

Kubernetes

Kubernetes (K8s) is a container orchestrator. It decides where to run containers, restarts failed ones, scales them up/down, and manages networking between them.

Key Concepts

Pod: Smallest deployable unit. One or more containers sharing a network namespace and storage. Usually one container per pod.
Deployment: Declares desired state: "run 5 replicas of this pod." K8s ensures 5 are always running. Rolling updates without downtime.
Service: Stable network endpoint for a set of pods. Load balances across pods. Pods come and go; the Service IP stays constant.
Ingress: HTTP routing. Routes external traffic to the right service based on URL path or hostname.
ConfigMap / Secret: Inject configuration and secrets into pods without baking them into images.
HPA (Horizontal Pod Autoscaler): Automatically scales pod count based on CPU/memory/custom metrics.
Namespace: Logical isolation within a cluster. Separate namespaces for prod, staging, each team.

In system design interviews: mention K8s for container orchestration, auto-healing (restarts failed pods), rolling deployments, and autoscaling. Don't dive into kubectl commands — focus on the architectural capabilities it provides.

❦

DevOps 03

CI/CD Pipelines

Developer pushes code to main branch
        ↓
CI Pipeline triggers:
  1. Pull latest code
  2. Install dependencies
  3. Run linters (code quality)
  4. Run unit tests
  5. Run integration tests
  6. Build Docker image
  7. Push image to registry
  8. Security scan (Trivy, Snyk)
        ↓
CD Pipeline (if all checks pass):
  1. Deploy to staging
  2. Run smoke tests
  3. Run E2E tests
  4. Deploy to production (blue-green or canary)
  5. Monitor for error rate spike
  6. Automatic rollback if error rate > threshold

Blue-Green Deployment

Run new version (green) alongside old (blue)
Switch traffic instantly when ready
Instant rollback: switch back to blue
Requires double the infrastructure

Canary Deployment

Send 1% of traffic to new version
Monitor error rate, latency, business metrics
Gradually increase to 100% if healthy
Automatic rollback on anomaly detection

❦

DevOps 04

Observability — The Three Pillars

Observability is the ability to understand a system's internal state from its external outputs. The three pillars: metrics, logs, and traces. You need all three — they answer different questions.

📊 Metrics

Aggregated numerical measurements over time
CPU%, request rate, error rate, latency p99
Good for: dashboards, alerting, capacity planning
Tools: Prometheus, Datadog, CloudWatch

📋 Logs

Discrete events with context
"User 123 failed login at 14:32:01"
Good for: debugging specific incidents
Tools: ELK Stack, Loki, Splunk

🔍 Traces

End-to-end request journey across services
Shows latency at each service hop
Good for: finding bottlenecks in distributed systems
Tools: Jaeger, Zipkin, AWS X-Ray, Tempo

Golden Signals (Google SRE)

Latency: Time to serve requests
Traffic: Requests per second
Errors: Rate of failed requests
Saturation: How "full" is your service

❦

DevOps 05

Autoscaling

Horizontal Pod Autoscaler (HPA): Adds/removes pods based on CPU, memory, or custom metrics (e.g., Kafka consumer lag). Fast (30s reaction time).
Vertical Pod Autoscaler (VPA): Adjusts CPU/memory limits for existing pods. Requires restart — not good for availability-critical services.
Cluster Autoscaler: Adds/removes nodes from the cluster when pods can't be scheduled. Slower (1–5 minutes to provision a VM).
KEDA (Event-driven autoscaling): Scale on Kafka lag, SQS queue depth, HTTP request count. More reactive than HPA for async workloads.
Predictive autoscaling: Scale before load hits using historical patterns (Black Friday, morning traffic spike). AWS Auto Scaling supports scheduled scaling.

Always pre-warm: sudden traffic spikes (viral content, product launches) can overwhelm autoscaling. For known events, pre-scale manually. Set minimum replicas high enough to handle sudden spikes before autoscaling kicks in.

Module 08 Quiz

Test Your DevOps & Infrastructure Knowledge

Production scenario questions. Select the best answer.

Q1. You deploy a new version of your API. 2 minutes later, error rate spikes from 0.1% to 15%. What deployment strategy would let you limit blast radius and roll back instantly?

Q2. A Kubernetes Pod keeps restarting. kubectl describe pod shows CrashLoopBackOff. First debugging step?

kubectl logs --previous shows logs from the crashed previous container instance. Since the pod is restarting, the current container logs may be minimal (crash happens fast). The previous logs contain the error — OOM kill, missing env var, failed DB connection, uncaught exception on startup. This is the fastest diagnostic path.

Q3. Your service's CPU is at 40% but latency is 3× normal. The Google SRE Golden Signals point to which signal being the problem?

Q4. You have a Docker image: FROM ubuntu:20.04, then 5 RUN commands, then COPY of the app code. Every code change causes all 5 RUN commands to re-run. How do you fix this?

Q5. Your team wants to track how a single user request flows through 12 microservices. Logs in each service exist but correlating them is impossible. What observability tool solves this?

DevOps &Infrastructure

DevOps 01

Docker & Containers

Docker Architecture

Why Containers Matter for System Design

DevOps 02

Kubernetes

Key Concepts

DevOps 03

CI/CD Pipelines

Blue-Green Deployment

Canary Deployment

DevOps 04

Observability — The Three Pillars

📊 Metrics

📋 Logs

🔍 Traces

Golden Signals (Google SRE)

DevOps 05

Autoscaling

Module 08 Quiz

Test Your DevOps & Infrastructure Knowledge

DevOps &
Infrastructure