Codemia | Master System Design Interviews Through Active Practice

My Solution for Design a Distributed Locking System with Score: 9/10

by utopia4715

System requirements

Functional:

I will focus on a simple lock in this assignment.

A lock has a unique ID based on 32 character random string.

Client A requests and acquires a lock by making a remote call.

Client B requests a lock. Since the lock is already held by A, B would not get it. B's request returns an error.

Client A releases the lock.

Client B can now request and acquire the lock.

Non-Functional:

The system should have a response time lower than 10ms for all API requests.
The system should prevent or detect a dead lock.
The system should be resilient.

Capacity estimation

No numbers were given. I assume:

1,000 services can be trying to get a lock per second.
There can be 100 locks.

API design

acquire_lock(client_id, lock_id, TTL)

TTL indicates how long the lock will have before expiring, in milliseconds. The default value would be some large number, e.g., 1 hour or 1 day.

-> returns success or failure

Note that it does not wait. If the lock has already been taken, this request would fail immediately.

release_lock(client_id, lock_id)

Client releases the lock.

refresh_lock(client_id, lock_id, TTL)

Client can refresh the TTL of the lock by calling the API.

Database design

Because of the low latency requirement, I assume the main data store to be an in-memory cache, e.g., Redis, Memcached, or local cache.

Among those, Redis provides good features:

High performance
Persistence to disk
Atomic operations.

Redis would have this data structure:

{lock_id, // 32 byte GUID

owner_id, // ID of the client who currently holds the lock.

expires_at}

Each item would come out to about 64 bytes.

For 100 locks, this would be only 6.4KB.

Even if we make this 100X, it would still be 640KB, which easily fits into one Redis instance.

High-level design

See the diagram.

Request flows

Client sends API requests acquire_lock() or release_lock() to API Gateway. API GW forwards the requests to Lock Service.

Lock Service uses Cache as a data store.

Detailed component design

Lock Service

Lock Service, when it receives a lock request, first checks if there's a lock object associated with it in Cache.
If there is, it returns a response "lock is already taken, try again later". It can return a time when the lock is expected to be released.
If there is not, it would create the lock object: {lock_id, owner_id, expiration_time}
Before returning, Lock Service looks up the cache object (created in (3)). If it has the right owner_id, it would return success to the client.

Step (4) is crucial in the situation where more than one clients ask for the same lock at the same time. For example, Clients A and B ask for the same lock at the same time. Neither sees a cache object, so both will try to create them. Both succeed, and owner_id would be Client A. In Step (4), Client B recognizes a problem happened, and it would return failure to the client.

Trade offs/Tech choices

Partitioning

Although one Redis server can fit all the locks, it would be a good idea to have multiple instances of Lock Services and Cache Servers, for fault tolerance and response time.

Cache server should be partitioned by Lock ID because one lock should appear only on one Cache server. We can use consistent hashing to prevent one server getting too much load.

Lock Service can be partitioned by Lock ID or Client ID.

Failure scenarios/bottlenecks

Deadlock

Let's look at the simple deadlock scenario:

Client A takes Lock 1. It now wants to take Lock 2.

Client B takes Lock 2. It now wants to take Lock 1.

We can take several approaches to prevent this from happening:

a. Ordering. Make sure clients have to take locks in some order, e.g., in an increasing order. This would prevent Client B from taking Lock 2, avoiding the deadlock. Pro: a clean solution. Con: clients must act according to the protocol.

b. Timeout. Make sure clients do not keep trying forever. For example, Client B can try to take a lock for 1 minute, and give up after that. This would let Client A to take Lock 2. Pro: a simple solution. Con: clients must act according to the protocol.

c. Maintain a graph of which clients need which locks. By detecting a cycle in a graph, we can detect a deadlock. Once it's detected, we can force expiration of a lock involved. Pro: efficient approach, as it takes an action only when there's a deadlock. Con: we need to create a mechanism for server to let the client know it is expiring a lock.

As we are trying to build a generic system, the timeout based approach seems to be simple and generally useful. If we need more sophisticated solutions, we can invest more in solutions like the graph based approach.

Out of Order Requests

If requests are received out of order, suboptimal outcome can occur. For example, if release_lock() is received before acquire_lock(), then the system would ignore release_lock(), and hold the lock until it expires. To avoid this, the requests should have sequence numbers. Lock Service can maintain historical requests for some time, and use them to avoid misbehaving like that.

Future improvements

Since a distributed synchronization is such a critical component, I would try using formal methods to prove the correctness of the design and implementation. There are well established formal methods based testing mechanisms such as TLA+.