Codemia | Master System Design Interviews Through Active Practice

My Solution for Design a Weather Reporting System with Score: 8/10

by nebula_jubilee499

System requirements

The assumption here is that this is a mobile application and not a web or computer application. Another assumption is that an External API is hosting the information regarding weather around specific locations.

Functional:

Must be able to provide most most recent weather conditions for the upcoming 7 day period for any given location
If based on customer location, GPS connection is necessary
Possibly subscribe to a series of different locations (cities) in the main feed outside of a possible Home location
Push notifications to the users about impending severe weather connections to users both in a certain area of effect and possibly to individuals who subscribe to weather updates for those areas.

Non-Functional:

Performance

Performance should be decent with low latency other than the calls being made to retrieve data for a given location. External Weather API cannot be overwhelmed with requests and as such a rate limiter will be required.

Availability

Application should be ready to respond to requests or push real-time data to users based on subscribed/home locations given.

Capacity estimation

To provide more realistic use cases for the bounds of this problem, it can be assumed that there exists currently other competing applications that deliver similar functionality. As such despite having a worldwide user base, the number of active users of our application can be said to be roughly 1 million on a daily basis (this is based on the Weather Channel application currently having 35 million active users based on their app description and theirs has been around for a long while).

Given the 1 Million daily active user base, the number of requests that are incoming will need to be rate limited as mentioned earlier. In order to do this, we can limit users to a maximum of about 5 requests every minute (this would include batch requests where a user is refreshing the feed where they may be subscribed to 20 locations at once).

The math then becomes:

10 requests/minute

10 reqs/min * 60 mins/hour = 600 requests / hour

600 reqs / hour * 24 hours/ day = 14400reqs / day

Thus a user will have a max of 7200 reqs /day to leverage. Scaling this up to our supposed 1 Million users, that becomes ~14.5 billion requests / day at worst case

API design

Given that we are leveraging an external API, having Rest calls for typical requests alongside sockets or SDKs like Firebase that are available for emergency notifications would be ideal.

The primary API call would be the following REST API endpoing

GET /batch-weather

{

locations : [

{

city: ""

country: ""

]

}

This endpoint can be used to obtain data for a series of locations depending on how many places a user has subscribed to.

A secondary endpoint can be used to retrieve weather for a specific location that a user is looking up, or for the user's direct current location.

GET /weather?long=""&lat=""

This one will be specifically given details from whatever location a user is attempting to search up directly.

Database design

While the weather data we wish to retrieve will all be provided by the external API that was mentioned as part of our assumption, there will still remain a need to minimize the number of calls directly made to said external API. This will be to minimize costs that will be attributed through use of the external API. In order to do so, it would be best to leverage a cache of sorts to hold onto commonly used location data. As such an LRU cache can be held in place with a somewhat short time to live since weather data tends to shift at a moments notice.

The cache type we wish to use can either be in the form of memcached or Redis, either one can work here. For the cache itself, the storage will be in the form of key-value pairs with the keys being a mix of longitude + latitude or city + country and the value of course being the response that has been retrieved.

High-level design

Review HLD

Request flows

See Sequence Diagram

Detailed component design

The most intricate section of this application comes down to the Weather Retrieval Service. The reason behind this is the fact that it scans response data either from the cache or otherwise to confirm a state of Emergency. Once the state of Emergency ends, a second follow up notification must once again be sent alongside relevant data from the specific affected location to all Users both in the area and those subscribed to that topic.

To keep track of this, locations that are undergoing emergencies should have reserved slots in the cache based on priority of an event and some form of interval polling will be required in order to keep track of the given situation. This traffic should also be separated from the usual requests and marked as high priority.

Due to the high volume of incoming requests it will also be important to perhaps leverage a message queue like Kafka to keep track of all the incoming weather requests in order to prevent loss that may occur if the application were to fail and crash at any point. Because Kafka also can incorporate priority, this is a good choice to keep track of requests, holding them until their time and allowing requests involving emergencies to pass through first. The polling period for these emergencies though should be long enough so that regular traffic can still flow as well and prevent bottlenecking to some extent.

Trade offs/Tech choices

Tech choices regarding Kafka were explained in the previous section. Having the queue there will help to keep track of requests and ensure continuation in situations where network loss or system failure occur (ideally not to the Kafka host from where this queue is being leveraged

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

The external API that we are depandant on in order to retrieve our weather based information is a critical bottleneck. If it goes down at any point, we cannot retrieve any more data. The best that can be done in such a manner is leverage a more permanent copy of the cache such that the information there remains available until the API is once again available.

Another possible bottleneck is our LB itself which may find itself flooded if traffic suddenly surges to a point where the LB and rate limiter can no longer keep up.

Future improvements

Horizontal scaling could be done in this application in future more easily by splitting data retrieval based on region. For example we can divert the majority of weather requests from the EU to an instance of our service hosted in that region of the world instead of diverting our traffic to a single local from which the data flows.