Time and Clocks in Distributed Systems

System Design Fundamentals

Time and Clocks in Distributed Systems

Topics Covered

Network Time Protocol (NTP)

The NTP Stratum Hierarchy

How NTP Keeps Running: Polling and Discipline

How NTP Syncs a Clock: The Four-Timestamp Algorithm

Slewing vs Stepping

Where NTP Falls Short for Distributed Systems

Lamport Logical Clocks

The Three Rules

A Concrete Example

The One-Way Guarantee (and Its Limitation)

Total Ordering with Tie-Breaking

Where Lamport Clocks Appear in Practice

Limitations in Practice

Vector Clocks

The Data Structure

A Worked Example with Three Nodes

Comparing Vectors: Three Possible Relationships

Conflict Detection in Dynamo-Style Stores

The O(N) Overhead Problem

Hybrid Logical Clocks (HLC)

How HLC Works

A Concrete HLC Example

Why HLC Over the Alternatives

HLC in CockroachDB

The Bigger Picture: A Spectrum of Approaches

Every machine in a distributed system has its own hardware clock, and every hardware clock drifts. A typical quartz crystal oscillator drifts 10-20 parts per million, that is 1-2 milliseconds per minute, or up to 17 seconds per day. In a 100-node cluster, after just one hour without synchronization, clocks can disagree by over a hundred milliseconds. NTP exists to keep these clocks roughly aligned with UTC, typically within a few milliseconds on a LAN and tens of milliseconds over the internet.

The NTP Stratum Hierarchy

NTP organizes time sources in layers called strata, where each layer is one hop further from the ultimate time reference:

Stratum 0: the reference clocks themselves: GPS receivers, cesium atomic clocks, radio clocks (like DCF77 or WWVB). These are not directly network-accessible; they connect to Stratum 1 servers via serial or PPS (pulse per second) interfaces.
Stratum 1: servers directly attached to Stratum 0 devices. These are the most accurate network time sources, with sub-microsecond accuracy to the reference clock. Organizations like NIST, the US Naval Observatory, and Google operate Stratum 1 servers.
Stratum 2: servers that synchronize from Stratum 1. Most corporate NTP servers and public pools (pool.ntp.org) operate at this level. Accuracy is typically 1-10ms.
Stratum 3 and beyond: each additional hop adds latency and uncertainty. Your application servers typically sit at Stratum 3 or 4, with accuracy of 1-50ms depending on network conditions.

The key takeaway: by the time your application server gets its time from NTP, the accuracy is in the low millisecond range. This is good enough for logging and monitoring, but not good enough to determine which of two events happened first when the events are less than 10ms apart on different machines.

How NTP Keeps Running: Polling and Discipline

NTP is not a one-time sync. It continuously polls time servers and adjusts the local clock. A typical NTP client polls its configured servers every 64-1024 seconds (the interval adapts based on clock stability). Each poll returns a fresh offset estimate.

The NTP clock discipline algorithm combines multiple poll results using a phase-locked loop (PLL) that filters out network jitter and gradually steers the local clock. The algorithm maintains a running estimate of both the clock offset (how far off the clock is right now) and the clock frequency error (how fast the clock is drifting). Correcting both means the clock stays accurate between polls rather than drifting back immediately.

On Linux, the chrony implementation has largely replaced the original ntpd. Chrony converges faster after boot, handles intermittent network connectivity better (important for laptops and VMs), and achieves tighter synchronization in most environments. AWS, GCP, and Azure all provide dedicated NTP endpoints (169.254.169.123 on AWS) with sub-millisecond accuracy within their networks.

How NTP Syncs a Clock: The Four-Timestamp Algorithm

NTP estimates the clock offset between client and server using four timestamps:

T1: the client records its local time when it sends the NTP request.
T2: the server records its local time when it receives the request.
T3: the server records its local time when it sends the response.
T4: the client records its local time when it receives the response.

From these four values, NTP computes:

Round-trip delay = $(T4 - T1) - (T3 - T2)$ . This subtracts the server's processing time from the total elapsed time, giving the network transit time.
Clock offset = $((T2 - T1) + (T3 - T4)) / 2$ . This estimates how far ahead or behind the client clock is relative to the server, assuming symmetric network delay.

The assumption of symmetric delay (same latency in both directions) is the main source of error. On asymmetric paths (common in WAN links where upload and download travel different routes), the offset estimate can be wrong by half the asymmetry. NTP clients mitigate this by querying multiple servers, discarding outliers (the intersection algorithm), and averaging the remaining estimates.

Slewing vs Stepping

When NTP detects a clock offset, it has two correction strategies:

Slewing: for small offsets (under 128ms by default), NTP gradually adjusts the clock rate. Instead of jumping, the clock speeds up or slows down slightly until it converges with the server. This is safe for applications because timestamps remain monotonically increasing, with no backward jumps. The tradeoff is speed: slewing a 100ms offset at the maximum slew rate of 500 parts per million takes about 200 seconds.

Stepping: for large offsets (over 128ms), NTP jumps the clock to the correct time instantly. This is necessary because slewing a 500ms offset would take many minutes. But stepping is dangerous: the clock can jump backward, causing timestamps to go backward too. Log entries can appear out of order, lease-based locks can expire at unexpected times, and distributed databases can assign lower timestamps to later events.

Common Pitfall

Clock steps are dangerous in distributed systems. If NTP steps a clock backward, a log entry at 12:00:01.000 can be followed by one at 12:00:00.500. Distributed databases using wall-clock timestamps for ordering can assign a lower timestamp to a later event, violating causality. Lease-based locks can appear valid on the client when they have already expired on the server. This is exactly why logical clocks exist: they provide ordering guarantees that physical clocks cannot.

Where NTP Falls Short for Distributed Systems

NTP keeps clocks within a few milliseconds of each other, but it cannot guarantee ordering. If Event A on Node 1 has timestamp 12:00:00.005 and Event B on Node 2 has timestamp 12:00:00.003, you cannot conclude that B happened before A. The clocks could easily be 5ms apart, meaning A might have happened first despite having the higher timestamp.

This uncertainty has real consequences in production:

Distributed databases: CockroachDB, Spanner, and YugabyteDB all need to order transactions across nodes. NTP alone is not accurate enough to determine which transaction committed first when they are within milliseconds of each other.
Monitoring and debugging: log timestamps from different servers can appear out of order, making cross-service request tracing confusing. A log entry from Service A that triggered a call to Service B might have a later timestamp than Service B's log entry for handling that request.
Lease-based locks: if a distributed lock expires at time T on the lock server, but the client's clock is behind by 10ms, the client thinks it still holds the lock when the server has already released it to another client. Two processes now believe they hold the same lock.

Google's TrueTime solves this differently: instead of a point estimate, TrueTime returns a time interval [earliest, latest] with a bounded uncertainty (typically under 7ms). Spanner uses "commit wait": it waits out the uncertainty interval before committing, guaranteeing that the commit timestamp is after all concurrent transactions. Most systems cannot afford the GPS and atomic clock hardware that TrueTime requires, which is why logical and hybrid clocks exist as software-only alternatives.

Lamport Logical Clocks

NTP tries to synchronize physical clocks and gets close but never perfect. Lamport's insight (1978) was to abandon wall-clock time entirely and instead assign numbers to events that respect causality. If event A caused event B, meaning information from A could have influenced B, then A's timestamp is guaranteed to be smaller than B's. This is the happens-before relationship, and understanding it is fundamental to reasoning about ordering in distributed systems.

The Three Rules

Each process maintains a single integer counter L. The entire algorithm is three rules:

Rule 1, Local event: Increment the counter. L = L + 1

Rule 2, Send a message: Increment the counter and attach it to the message. L = L + 1; send(message, L)

Rule 3, Receive a message with timestamp ts: Take the max of local and received, then increment. L = max(L, ts) + 1

That is the entire algorithm. The max on receive is the crucial operation: it ensures the receiver's clock jumps ahead to at least match the sender's, preserving the causal chain across processes.

A Concrete Example

Consider two processes P and Q:

P does a local event. $L_P$ goes from 0 to 1.
P sends message M to Q. $L_P$ increments to 2. The message carries timestamp 2.
Q has been doing its own work independently. $L_Q$ is at 1.
Q receives message M with timestamp 2. Q computes $max(1, 2) + 1 = 3$ . $L_Q$ is now 3.
Q sends a reply N to P. $L_Q$ increments to 4. The message carries timestamp 4.
P receives reply N with timestamp 4. P computes $max(2, 4) + 1 = 5$ . $L_P$ is now 5.

The causal chain is preserved: the send of M ( $L=2$ ) happened before the receive of M ( $L=3$ ), the receive of M happened before the send of N ( $L=4$ ), and the send of N happened before the receive of N ( $L=5$ ). Each causal step has a strictly increasing timestamp.

The One-Way Guarantee (and Its Limitation)

Lamport clocks provide a one-way implication:

If A happens-before B, then $L(A) < L(B)$ : this is guaranteed by the algorithm.
If $L(A) < L(B)$ , A might or might not have happened before B: the converse is NOT guaranteed.

Why? Because two completely independent processes can have different Lamport values without any causal relationship. If Process P has $L=10$ from doing local work, and Process Q has $L=3$ from doing less work, and they have never communicated, then $L(P's event) > L(Q's event)$ but there is no causal relationship between them. The timestamps reflect the amount of local activity, not causation.

This asymmetry is the fundamental limitation of Lamport clocks. If you see $L(X) = 5$ and $L(Y) = 7$ , you know that X did not happen after Y (because that would require $L(X) > L(Y)$ ). But you cannot conclude that X caused Y; they might be concurrent events on independent processes.

Key Insight

Lamport clocks answer 'did A happen before B?' with only a one-way guarantee. If L(A) < L(B), you know A did not happen AFTER B, but you cannot tell whether A CAUSED B or they are independent. To definitively detect that two events are concurrent (neither caused the other), you need vector clocks, which track per-process progress instead of a single shared counter.

Total Ordering with Tie-Breaking

Lamport timestamps can impose a total order on all events by using the pair (L, process_id) as the sort key. If two events have the same Lamport timestamp, the process ID breaks the tie deterministically. This total order is consistent with causality (causal events are correctly ordered) but gives concurrent events an arbitrary (though deterministic) ordering.

This is the basis of Lamport's distributed mutual exclusion algorithm: each process broadcasts its lock request with a Lamport timestamp, maintains a priority queue sorted by (timestamp, process_id), and enters the critical section only when its request is at the front of the queue and it has received acknowledgments with higher timestamps from all other processes.

Where Lamport Clocks Appear in Practice

ZooKeeper: ZXIDs (transaction IDs) follow Lamport semantics. Each transaction increments the ZXID. Followers adopt the leader's ZXID when replicating, preserving causal ordering across the cluster.
Raft consensus: log indices and term numbers function as logical timestamps. Each new entry increments the log index. Leader election increments the term. The combination (term, index) provides a total order consistent with causality.
Distributed databases: many use monotonically increasing logical timestamps for transaction ordering, ensuring that if transaction T1 causally preceded T2 (e.g., T2 reads T1's write), then T1 gets a lower timestamp.

Limitations in Practice

Lamport clocks are elegant but limited. They cannot detect concurrency: given $L(X)=5$ and $L(Y)=7$ , you cannot tell whether X caused Y or they are independent. They also cannot detect gaps in the causal chain. If you receive a message with $L=10$ but your clock is at $L=2$ , you know you missed some events, but you do not know which ones or how many. The clock jumps forward to preserve the causal relationship but provides no information about what happened in between.

For systems that need only a total order consistent with causality (consensus algorithms, replicated state machines), Lamport clocks are sufficient and efficient. For systems that need to detect conflicts between independent writers (distributed key-value stores, collaborative editing), vector clocks are necessary.

The simplicity of Lamport clocks (just one integer per process) makes them practical for high-throughput systems where vector clocks (one integer per process per event) would be too expensive.

Vector Clocks

Lamport clocks tell you "if A caused B, then $L(A) < L(B)$ ." Vector clocks tell you the full story: if $V(A) < V(B)$ , then A definitively caused B. And if neither $V(A) <= V(B)$ nor $V(B) <= V(A)$ , the events are definitively concurrent: neither caused the other. This bidirectional reasoning is what makes vector clocks the right tool for conflict detection in distributed storage systems like Amazon Dynamo and Riak.

The Data Structure

For a system with N processes, each process maintains a vector (array) of N integer counters. Process i's vector V_i tracks how many events process i knows about from every other process. The rules mirror Lamport clocks but operate on vectors:

Local event at process i: increment own entry. V_i[i] += 1
Send from process i: increment own entry and attach the full vector to the message. V_i[i] += 1; send(message, V_i)
Receive at process j with message vector V_msg: merge component-wise, then increment own entry. V_j[k] = max(V_j[k], V_msg[k]) for all k; V_j[j] += 1

The element-wise max on receive is what gives vector clocks their power: after receiving, process j's vector reflects everything it knows from its own history AND everything the sender knew.

A Worked Example with Three Nodes

Consider three nodes P1, P2, P3 in a key-value store, each starting with vector [0,0,0]:

P1 writes key K. P1 increments its own entry: $V_{P1} = [1,0,0]$ . The write is stored with this vector.
P2 reads P1's write (during replication). P2 merges vectors: $max([0,0,0], [1,0,0]) = [1,0,0]$ , then increments its own entry: $V_{P2} = [1,1,0]$ . P2 has now "seen" P1's write.
P2 writes a new value to key K. P2 increments: $V_{P2} = [1,2,0]$ . This write causally follows P1's write because P2 saw it ( $V_{P2}[0] >= V_{P1}[0]$ ).
Meanwhile, P3 independently writes to key K without having seen any of P1 or P2's writes. P3 increments: $V_{P3} = [0,0,1]$ .

Now compare P2's version $[1,2,0]$ with P3's version $[0,0,1]$ : $P2[0]=1 > P3[0]=0$ , but $P2[2]=0 < P3[2]=1$ . Neither dominates: these writes are concurrent. The system must keep both versions and let the application resolve the conflict.

Compare P1's version $[1,0,0]$ with P2's version $[1,2,0]$ : $P1[0]=1 <= P2[0]=1$ , $P1[1]=0 <= P2[1]=2$ , $P1[2]=0 <= P2[2]=0$ , with at least one strictly less. So $P1 < P2$ : P1's version is causally older and can be safely discarded.

Comparing Vectors: Three Possible Relationships

Given two event vectors V(A) and V(B), exactly one of three relationships holds:

$V(A) < V(B)$ : every component of V(A) is less than or equal to the corresponding component of V(B), with at least one strictly less. This means A happened before B with certainty.
$V(B) < V(A)$ : B happened before A, by the same component-wise comparison.
Concurrent: some components of V(A) are greater and some are less than V(B). Neither event caused the other. They occurred independently.

This third case, detecting concurrency, is exactly what Lamport clocks cannot do. It is the key to correct conflict resolution in distributed data stores.

Conflict Detection in Dynamo-Style Stores

Amazon Dynamo (and its open-source descendants Riak and Voldemort) uses vector clocks for versioning. Each stored object carries a vector clock as metadata. The conflict detection process works as follows:

On a write: The client reads the current value along with its vector clock V. The client makes changes and sends the updated value to the coordinating node. The coordinator increments its own component in V, producing V'. The new version is stored with V'.

On a read with multiple versions: If replicas have diverged (due to concurrent writes during a network partition), a read may return multiple versions with their vector clocks V1, V2:

If $V1 < V2$ : version 1 is strictly older, so it can be safely discarded. Version 2 supersedes it.
If V1 and V2 are concurrent (neither dominates): both versions are returned to the client as siblings. The application must resolve the conflict.

For a shopping cart, the resolution might be taking the union of items from both versions. For a user profile, it might be taking the version with the later wall-clock timestamp (last-write-wins on concurrent updates). The important thing is that the system detects the conflict rather than silently discarding one version.

Key Insight

Vector clocks answer the question Lamport clocks cannot: are these two events truly independent? When Dynamo detects concurrent vector clocks on two versions of the same key, it knows with certainty that neither version saw the other write. Without vector clocks, the system would either silently discard one version (data loss) or always keep both versions (unnecessary conflict resolution even when one clearly supersedes the other).

The O(N) Overhead Problem

The major drawback of vector clocks is that each vector has one entry per process. In a cluster with 1,000 nodes, every event carries a 1,000-element vector. Every comparison examines all 1,000 elements. For a database handling millions of operations per second, this metadata overhead becomes significant.

Practical mitigations:

Scope to the replica set: Dynamo replicates each key across only 3-5 nodes. The vector clock for a key has entries only for those 3-5 nodes, not the entire cluster. This keeps vectors small and comparison fast.
Pruning: discard vector entries that have not been updated for a long time (e.g., 24 hours). This risks losing causality information for old events but keeps vector size bounded. Riak uses this approach with configurable age and size limits.
Dotted version vectors: an optimization used by Riak that distinguishes between a node's causal context (what it has seen) and the specific event it created. This reduces vector growth when multiple concurrent writes come from the same client coordinator.

For small, fixed replica sets (3-5 nodes per key), vector clock overhead is negligible: 3-5 integers per object version. For systems with hundreds of participants per data item, vector clocks become impractical and hybrid logical clocks (constant space regardless of cluster size) are preferred.

Hybrid Logical Clocks (HLC)

Logical clocks give you causal ordering but lose all connection to real-world time: you cannot say "this transaction committed at 2:03 PM." Physical clocks give you real-world time but cannot guarantee ordering across nodes. Hybrid Logical Clocks (HLC) combine both worlds: each timestamp is a pair (physical_time, logical_counter) that stays close to wall-clock time while preserving causal ordering. This is the approach used by CockroachDB, YugabyteDB, and other modern distributed databases that need both human-readable timestamps and correct transaction ordering without specialized hardware.

How HLC Works

Each node maintains two values: pt (the last physical time used) and lc (a logical counter for disambiguation).

On a local event or message send:

Read the current wall-clock time now from the system clock.
If $now > pt$ : the physical clock has advanced. Set $pt = now$ , $lc = 0$ . The timestamp is $(now, 0)$ , a clean, real-time timestamp.
If $now <= pt$ : the physical clock has not advanced (or has drifted backward). Keep pt unchanged, set $lc = lc + 1$ . The timestamp is $(pt, lc+1)$ : same physical time, bumped logical counter for monotonicity.

The intuition: if real time moves forward, use it directly and reset the logical counter. If it does not (because of clock skew or two events in the same millisecond), keep the physical component stable and use the logical counter to maintain strict monotonicity.

On receiving a message with HLC (msg_pt, msg_lc):

Compute $new\_pt = max(now, msg\_pt, pt)$ : the physical component advances to the maximum of local wall-clock time, received physical time, and the node's last physical component.
The logical counter is set based on which physical times matched (ensuring monotonicity through a series of comparisons), and is reset to 0 whenever the physical component advances past all previous values.

The result: HLC timestamps never go backward on any node, and if event X causally precedes event Y (through message passing), then $HLC(X) < HLC(Y)$ lexicographically.

A Concrete HLC Example

Suppose two nodes A and B, both with NTP-synced clocks:

Node A does an event at wall time 1000. Since $1000 > pt$ (which was 0), set $pt=1000$ , $lc=0$ . Timestamp: $(1000, 0)$ .
Node A does another event, wall time still reads 1000 (same millisecond). Since $1000 <= pt$ (1000), keep $pt=1000$ , increment $lc=1$ . Timestamp: $(1000, 1)$ . Two events in the same millisecond get distinct, ordered timestamps.
Node B has wall time 998 (2ms behind A due to NTP skew). Node B receives a message from A carrying $(1000, 1)$ . $new\_pt = max(998, 1000, 0) = 1000$ . Since new_pt equals msg_pt, set $lc = max(lc, msg\_lc) + 1 = max(0, 1) + 1 = 2$ . Timestamp: $(1000, 2)$ .
Later, Node B's wall time advances to 1003. Node B does a local event. Since $1003 > pt$ (1000), set $pt=1003$ , $lc=0$ . Timestamp: $(1003, 0)$ . The physical clock caught up, so the logical counter resets.

Notice that the physical component stays close to real time (never more than a few ms off) while the logical counter handles same-millisecond events and clock skew. This is what makes HLC timestamps useful for both ordering and human readability.

Why HLC Over the Alternatives

Pure NTP timestamps: human-readable and cheap, but clocks can skew by milliseconds. Two transactions committed 1ms apart might get reversed timestamps if clocks are 5ms apart. No causal guarantee whatsoever.

Pure Lamport clocks: perfect causal ordering, but timestamps bear no relation to wall-clock time. A Lamport timestamp of 47382 tells you nothing about when the event occurred. Debugging and compliance ("when was this record modified?") become hard.

Vector clocks: detect both causality and concurrency, but O(N) space per event where N is the number of participating processes. For a 100-node database cluster handling millions of transactions, attaching a 100-element vector to every transaction and MVCC version is prohibitively expensive.

HLC: the physical component is real wall-clock time (within NTP accuracy), so timestamps are human-readable. The logical component handles ties and small skew, preserving causal ordering. Space is constant: just two integers per timestamp, regardless of cluster size. The trade-off: HLC provides a total order, not concurrency detection. It cannot tell you whether two events are concurrent (like vector clocks can), only that one has a lower timestamp.

HLC in CockroachDB

CockroachDB uses HLC timestamps as the foundation of its MVCC (Multi-Version Concurrency Control) engine:

Every write is tagged with an HLC timestamp at commit time.
Reads at timestamp T return the latest version of each key with HLC less than or equal to T.
If transaction T1 writes a value and transaction T2 subsequently reads it, the HLC receive rule guarantees $HLC(T1) < HLC(T2)$ , preserving read-after-write consistency across nodes without GPS or atomic clock hardware.

CockroachDB enforces a maximum clock skew of 500ms between nodes (configurable via --max-offset). If a node's clock is more than 500ms ahead or behind the cluster, it is rejected from the cluster until it re-syncs. This bounded skew ensures the logical counter only needs to absorb small amounts of clock disagreement, not arbitrary divergence.

When a CockroachDB transaction encounters a value with a higher HLC timestamp than expected (indicating possible clock skew), it can restart the transaction at the higher timestamp rather than returning stale data. This "read refreshing" mechanism handles the edge cases where clock skew would otherwise cause a transaction to miss a concurrent write. The 500ms bound limits how far back in time a restart might need to go, bounding the performance impact.

YugabyteDB uses a similar HLC approach with the same bounded-skew assumption. Both databases demonstrate that HLC provides a practical middle ground: correct causal ordering for transaction processing, human-readable timestamps for debugging and compliance, and constant-space metadata, all without specialized hardware.

The Bigger Picture: A Spectrum of Approaches

The clocking mechanisms in this lesson form a spectrum from simple-but-fuzzy to precise-but-expensive:

NTP only: fast, cheap, real-world time, but no ordering guarantee across nodes. Fine for logging, monitoring, and human-facing timestamps.
Lamport clocks: one integer, causal ordering guaranteed, but no real-world time and no concurrency detection. Used in consensus algorithms (Raft, ZooKeeper).
Vector clocks: N integers, full causality and concurrency detection. Used in eventually-consistent stores with small replica sets (Dynamo, Riak).
HLC: two integers, causal ordering plus wall-clock fidelity. The practical sweet spot for distributed databases (CockroachDB, YugabyteDB) that need correct ordering without specialized hardware.
TrueTime: GPS + atomic clocks, bounded uncertainty interval, "commit wait" for linearizability. The gold standard (Google Spanner) but requires specialized hardware most organizations cannot afford.

Each system picks the mechanism that matches its ordering requirements and operational constraints. The common thread across all of these approaches: physical time alone is never sufficient for correct distributed ordering. Choosing the right clock type means understanding your system's specific tradeoff between ordering precision, metadata cost, and operational simplicity.

Course

System Design Fundamentals

Networking & APIs

Storage & Data Modeling

Partitioning, Replication & Consistency

Caching & Edge

Messaging & Streaming

Reliability & Operability

Security & Privacy

Common Interview Scenarios