What makes system reliable

Not about preventing all failures, but being able to work correctly when failures happen

CAP Theorem

Establishes a fundamental constraint
Distributed systems can only provide 2 at most of these 3

Consistency

All nodes see the same data at the same time

Availability

Every request to a non-failing node receives a response

Partition Tolerance

System continues to operate despite network failures

Since network partitions are unavoidable in distributed systems, we end up having to choose between consistency and availability during partition events

EG Google Spanner chooses consistency, using atomic clocks to synchronise time across global data centres to maintain linearisable transactions
- Majority partitions remain read and write, while minority become read only
- Trade off: minority lose write access
EG Amazons Dynamo DB chooses availability: it uses eventual consistency, allowing writes during network partitions, resolving conflicts based on last write wins based on timestamps
- Trade off: Occasionally get stale data

Eventual Consistency

A promise that is: if we stop making updates, all replicas will eventually converge to the same state
Powerful for large scale systems
Allows writes to complete immediately without waiting for confirmation from all replicas, improves performance and availability
EG Amazon: allows user to add items even if some servers will temporarily unreachable

How handle conflicts

Last write wins use the most recent timestamp to pick the most recent update
- Simple, but can lose data
Conflict free replicated data types use mathematical properties to guarantee all replicas converge to the same state regardless of update order
Custom logic to resolve conflicts based on business rules Dynamo DB typically achieve consistency within milliseconds under normal conditions, making eventual consistency barely noticeable

Load balancing

Distribute requests across multiple servers

Types of effective load balancing

Layer 4 load balancer
1. Operate on the transport layer
2. Make routing decisions based on IP addresses and TCP/UDP ports
3. Fast because do not need to inspect application data
4. Limited in routing intelligence
Layer 7 load balancer
1. Operates on the application layer
2. Can examine HTTP headers, URLs and request content to make smarter routing decisions
3. More computationally expensive
Least connections
1. Sends requests to the server handling fewest active connections
Least time
1. Factors how quickly each server responds, avoiding slower servers

Consistent hashing

Ensure the same client consistently hits the same server, critical for maintaining session state

Problems with horizontally scaled systems

Data is spread across multiple nodes, and we need to add or remove nodes without moving massive amounts of data
Traditional hashing does not work, adding 1 node requires remapping for almost all data
- Expensive and disruptive

How consistent hashing solves this

Instead of mapping keys to nodes, both are placed on a circular hash ring
Used by Amazon Dynamo DB and Apache Cassandra, allowing efficient horizontal scaling

Hash each node to determine its position on the ring
To find where data belong, hash the key and walk clockwise around the ring until we hit the first node
Replicate the data to the next n-1 node clockwise on the ring
When new node is added, it takes ownership from its immediate neighbours
When a node is removed, its keys get redistributed to the next node on the ring
Only requires moving K/n keys instead of nearly all keys (k=keys, n=node)

Circuit breakers

Prevents failures across dependent services when 1 service in a distributed service fails
Monitors failure rate of calls to the dependent services

Three states

Closed: requests flow normally
Open: requests are blocked immediately without attempting to call the failing service, returning fast failures
Half open: a few test requests allowed through to check if the service has recovered

How it works

Circuit breaker checks successful and failed requests to a service
When the failure rate exceeds threshold, circuit breaker trips to open state
Prevents resource exhaustion and gives the service time to recover
After a timeout period, circuit breaker moves to half open and allow a few test requests through
If the test requests succeed, the circuit breaker closes again and normal operation resumes

Origin

Pioneered by Netflix’s Hystrix library
Now standard in microservice architecture

Rate limiting

Controls how many requests a client can make during a given time window
Protects systems from overload and abuse

Algorithms to rate limit

Token bucket algorithm
1. Accumulate tokens at a fixed rate, up to a maximum capacity
2. Each request consumes a token
3. Allows control burst while maintaining an average rate
Leaky bucket
1. Processes requests at a constant rate, smoothing out traffic spikes
2. Excess requests are queued or dropped
Fixed window
1. Counts requests in discrete time intervals
2. Simple but prone to boundary effects such as clients doubling the rate by timing requests around the window boundaries
Sliding window
1. Uses a rolling average that smooths out problems of fixed windows

Rate limiting in distributed systems

More complex, cannot just count requests in a single server, needs coordination across multiple instances
One solution is Redis rate limiter to allow consistent rate limiting across the service cluster

Tier-rate limiting

Different limits for authenticated vs anonymous users
Progressive stricter limits as suspicious patterns are detected

Monitoring

Provides visibility into system behaviour and performance
Large scale system generates terabytes of monitoring data daily
Static thresholds work for stable metrics, but fail when traffic patterns change
- Modern system uses statistical anomaly detection that learns normal patterns and alerts on deviation (ML models)
- Composite alerts combine multiple signals to reduce noise, such as alerting when CPU is high and error rates are increasing and response time is slow

4 key signal types:

Metrics
- Time series
- Numerical data: CPU usage, request rate, error count
Logs
- Structured records of discrete events with contextual information
Traces
- End to end request flows, showing how a single request move through the distributed system
Events
- Significant occurrences like deployment or config changes

Effective alerting

Balances 2 concerns: catch problems quickly, but not get overwhelmed by false alarms

Kai Ler's Portfolio

Explorer

4.8 - System Design Concepts

What makes system reliable

CAP Theorem

Consistency

Availability

Partition Tolerance

Eventual Consistency

How handle conflicts

Load balancing

Types of effective load balancing

Consistent hashing

Problems with horizontally scaled systems

How consistent hashing solves this

Circuit breakers

Three states

How it works

Origin

Rate limiting

Algorithms to rate limit

Rate limiting in distributed systems

Tier-rate limiting

Monitoring

4 key signal types:

Effective alerting

Graph View

Table of Contents