DevOps &
Infrastructure
How modern systems are deployed, scaled, monitored, and kept alive in production.
DevOps 01
Docker & Containers
A container packages an application with all its dependencies into a single portable unit. Unlike VMs, containers share the host OS kernel โ making them lightweight (MB vs GB) and fast to start (ms vs seconds).
A VM is like shipping your entire house (OS, libraries, app). A container is like a shipping container โ standardised box with just your goods inside. The ship (host OS) handles the physical layer. Containers are interchangeable and stack efficiently.
Docker Architecture
Dockerfile โ docker build โ Docker Image (immutable layers)
Docker Image โ docker run โ Container (running instance)
Layers (each instruction creates a layer):
FROM ubuntu:22.04 # Base layer (cached)
RUN apt install python # Python layer (cached if unchanged)
COPY app.py /app/ # App layer (rebuilt on change)
CMD ["python", "app.py"] # Entrypoint
Layer caching = fast builds. Put frequently changed layers last.
Why Containers Matter for System Design
- Environment consistency: "Works on my machine" is eliminated. Same container runs in dev, staging, production.
- Fast scaling: Spin up a new container in seconds vs minutes for a VM.
- Resource efficiency: Pack more workloads onto the same hardware.
- Immutable infrastructure: Never patch a running container โ rebuild and redeploy. Rollback = run old image.
DevOps 02
Kubernetes
Kubernetes (K8s) is a container orchestrator. It decides where to run containers, restarts failed ones, scales them up/down, and manages networking between them.
Key Concepts
- Pod: Smallest deployable unit. One or more containers sharing a network namespace and storage. Usually one container per pod.
- Deployment: Declares desired state: "run 5 replicas of this pod." K8s ensures 5 are always running. Rolling updates without downtime.
- Service: Stable network endpoint for a set of pods. Load balances across pods. Pods come and go; the Service IP stays constant.
- Ingress: HTTP routing. Routes external traffic to the right service based on URL path or hostname.
- ConfigMap / Secret: Inject configuration and secrets into pods without baking them into images.
- HPA (Horizontal Pod Autoscaler): Automatically scales pod count based on CPU/memory/custom metrics.
- Namespace: Logical isolation within a cluster. Separate namespaces for prod, staging, each team.
In system design interviews: mention K8s for container orchestration, auto-healing (restarts failed pods), rolling deployments, and autoscaling. Don't dive into kubectl commands โ focus on the architectural capabilities it provides.
DevOps 03
CI/CD Pipelines
Developer pushes code to main branch
โ
CI Pipeline triggers:
1. Pull latest code
2. Install dependencies
3. Run linters (code quality)
4. Run unit tests
5. Run integration tests
6. Build Docker image
7. Push image to registry
8. Security scan (Trivy, Snyk)
โ
CD Pipeline (if all checks pass):
1. Deploy to staging
2. Run smoke tests
3. Run E2E tests
4. Deploy to production (blue-green or canary)
5. Monitor for error rate spike
6. Automatic rollback if error rate > threshold
Blue-Green Deployment
- Run new version (green) alongside old (blue)
- Switch traffic instantly when ready
- Instant rollback: switch back to blue
- Requires double the infrastructure
Canary Deployment
- Send 1% of traffic to new version
- Monitor error rate, latency, business metrics
- Gradually increase to 100% if healthy
- Automatic rollback on anomaly detection
DevOps 04
Observability โ The Three Pillars
Observability is the ability to understand a system's internal state from its external outputs. The three pillars: metrics, logs, and traces. You need all three โ they answer different questions.
๐ Metrics
- Aggregated numerical measurements over time
- CPU%, request rate, error rate, latency p99
- Good for: dashboards, alerting, capacity planning
- Tools: Prometheus, Datadog, CloudWatch
๐ Logs
- Discrete events with context
- "User 123 failed login at 14:32:01"
- Good for: debugging specific incidents
- Tools: ELK Stack, Loki, Splunk
๐ Traces
- End-to-end request journey across services
- Shows latency at each service hop
- Good for: finding bottlenecks in distributed systems
- Tools: Jaeger, Zipkin, AWS X-Ray, Tempo
Golden Signals (Google SRE)
- Latency: Time to serve requests
- Traffic: Requests per second
- Errors: Rate of failed requests
- Saturation: How "full" is your service
DevOps 05
Autoscaling
- Horizontal Pod Autoscaler (HPA): Adds/removes pods based on CPU, memory, or custom metrics (e.g., Kafka consumer lag). Fast (30s reaction time).
- Vertical Pod Autoscaler (VPA): Adjusts CPU/memory limits for existing pods. Requires restart โ not good for availability-critical services.
- Cluster Autoscaler: Adds/removes nodes from the cluster when pods can't be scheduled. Slower (1โ5 minutes to provision a VM).
- KEDA (Event-driven autoscaling): Scale on Kafka lag, SQS queue depth, HTTP request count. More reactive than HPA for async workloads.
- Predictive autoscaling: Scale before load hits using historical patterns (Black Friday, morning traffic spike). AWS Auto Scaling supports scheduled scaling.
Always pre-warm: sudden traffic spikes (viral content, product launches) can overwhelm autoscaling. For known events, pre-scale manually. Set minimum replicas high enough to handle sudden spikes before autoscaling kicks in.