What makes system reliable

  • Not about preventing all failures, but being able to work correctly when failures happen

CAP Theorem

  • Establishes a fundamental constraint
  • Distributed systems can only provide 2 at most of these 3

Consistency

  • All nodes see the same data at the same time

Availability

  • Every request to a non-failing node receives a response

Partition Tolerance

  • System continues to operate despite network failures

Since network partitions are unavoidable in distributed systems, we end up having to choose between consistency and availability during partition events

  • EG Google Spanner chooses consistency, using atomic clocks to synchronise time across global data centres to maintain linearisable transactions
    • Majority partitions remain read and write, while minority become read only
    • Trade off: minority lose write access
  • EG Amazons Dynamo DB chooses availability: it uses eventual consistency, allowing writes during network partitions, resolving conflicts based on last write wins based on timestamps
    • Trade off: Occasionally get stale data

Eventual Consistency

  • A promise that is: if we stop making updates, all replicas will eventually converge to the same state
  • Powerful for large scale systems
  • Allows writes to complete immediately without waiting for confirmation from all replicas, improves performance and availability
  • EG Amazon: allows user to add items even if some servers will temporarily unreachable

How handle conflicts

  • Last write wins use the most recent timestamp to pick the most recent update
    • Simple, but can lose data
  • Conflict free replicated data types use mathematical properties to guarantee all replicas converge to the same state regardless of update order
  • Custom logic to resolve conflicts based on business rules Dynamo DB typically achieve consistency within milliseconds under normal conditions, making eventual consistency barely noticeable

Load balancing

  • Distribute requests across multiple servers

Types of effective load balancing

  1. Layer 4 load balancer
    1. Operate on the transport layer
    2. Make routing decisions based on IP addresses and TCP/UDP ports
    3. Fast because do not need to inspect application data
    4. Limited in routing intelligence
  2. Layer 7 load balancer
    1. Operates on the application layer
    2. Can examine HTTP headers, URLs and request content to make smarter routing decisions
    3. More computationally expensive
  3. Least connections
    1. Sends requests to the server handling fewest active connections
  4. Least time
    1. Factors how quickly each server responds, avoiding slower servers

Consistent hashing

  • Ensure the same client consistently hits the same server, critical for maintaining session state

Problems with horizontally scaled systems

  • Data is spread across multiple nodes, and we need to add or remove nodes without moving massive amounts of data
  • Traditional hashing does not work, adding 1 node requires remapping for almost all data
    • Expensive and disruptive

How consistent hashing solves this

  • Instead of mapping keys to nodes, both are placed on a circular hash ring
  • Used by Amazon Dynamo DB and Apache Cassandra, allowing efficient horizontal scaling
  1. Hash each node to determine its position on the ring
  2. To find where data belong, hash the key and walk clockwise around the ring until we hit the first node
  3. Replicate the data to the next n-1 node clockwise on the ring
  4. When new node is added, it takes ownership from its immediate neighbours
  5. When a node is removed, its keys get redistributed to the next node on the ring
  6. Only requires moving K/n keys instead of nearly all keys (k=keys, n=node)

Circuit breakers

  • Prevents failures across dependent services when 1 service in a distributed service fails
  • Monitors failure rate of calls to the dependent services

Three states

  1. Closed: requests flow normally
  2. Open: requests are blocked immediately without attempting to call the failing service, returning fast failures
  3. Half open: a few test requests allowed through to check if the service has recovered

How it works

  • Circuit breaker checks successful and failed requests to a service
  • When the failure rate exceeds threshold, circuit breaker trips to open state
  • Prevents resource exhaustion and gives the service time to recover
  • After a timeout period, circuit breaker moves to half open and allow a few test requests through
  • If the test requests succeed, the circuit breaker closes again and normal operation resumes

Origin

  • Pioneered by Netflix’s Hystrix library
  • Now standard in microservice architecture

Rate limiting

  • Controls how many requests a client can make during a given time window
  • Protects systems from overload and abuse

Algorithms to rate limit

  1. Token bucket algorithm
    1. Accumulate tokens at a fixed rate, up to a maximum capacity
    2. Each request consumes a token
    3. Allows control burst while maintaining an average rate
  2. Leaky bucket
    1. Processes requests at a constant rate, smoothing out traffic spikes
    2. Excess requests are queued or dropped
  3. Fixed window
    1. Counts requests in discrete time intervals
    2. Simple but prone to boundary effects such as clients doubling the rate by timing requests around the window boundaries
  4. Sliding window
    1. Uses a rolling average that smooths out problems of fixed windows

Rate limiting in distributed systems

  • More complex, cannot just count requests in a single server, needs coordination across multiple instances
  • One solution is Redis rate limiter to allow consistent rate limiting across the service cluster

Tier-rate limiting

  • Different limits for authenticated vs anonymous users
  • Progressive stricter limits as suspicious patterns are detected

Monitoring

  • Provides visibility into system behaviour and performance
  • Large scale system generates terabytes of monitoring data daily
  • Static thresholds work for stable metrics, but fail when traffic patterns change
    • Modern system uses statistical anomaly detection that learns normal patterns and alerts on deviation (ML models)
    • Composite alerts combine multiple signals to reduce noise, such as alerting when CPU is high and error rates are increasing and response time is slow

4 key signal types:

  • Metrics
    • Time series
    • Numerical data: CPU usage, request rate, error count
  • Logs
    • Structured records of discrete events with contextual information
  • Traces
    • End to end request flows, showing how a single request move through the distributed system
  • Events
    • Significant occurrences like deployment or config changes

Effective alerting

  • Balances 2 concerns: catch problems quickly, but not get overwhelmed by false alarms