System requirements


Functional:

  • The parking lot handles different vehicle types (cars and motorcycles)
  • Return count of free spaces real time
  • Enter the parking lot and receive a ticket. A ticket has an id, time entry.
  • Given a parking slot, park in it. The system accounts for the now-occupied lot.
  • Given a ticket id, exit the parking lot and get a receipt of the cost
  • Support different cost schemes per duration and time of day. Free for less than 15 minutes, max per day $20, and cost per half hour from 7am until 7pm, and from 7pm to 7am. Monthly/year pass holders.



Non-Functional:

List non-functional requirements for the system...

  • Easy to change price scheme in the future (evolvability).
  • Capacity of 1e5 parking slots.
  • 1e3 entry and 1e3 exist cars per hour average, 2e3 peak for each
  • Very high availability (people need to go in and out of the parking lot all the time). In case of system failure, possible mitigation is to open the entry and exit barriers until the system is restored.


Capacity estimation

Estimate the scale of the system you are going to design...

  • 1e5 parking slots. Each slot can take id (2 bytes = 2^16 bits ~ good for 64K slots), type (1 byte, car, small, motorcycle, electric stations, etc.), status (1 byte - free, occupied, unavailable (broken, reserved)) = 4 bytes. Total 4e5 Bytes < 1MB
  • 2e3 entry and exit per hour ~ 33 rpm ~ 0.5 rps. Peak ~1 rps. Same for payment



API design

Define what APIs are expected from the system...

  • Get number of free spaces per type
  • Enter the parking lot and receive a ticket. A ticket has an id, and time of entry
  • Given a ticket id, see cost, pay, get a receipt of the cost, exit the parking lot




Database design

Database Choice: Key tables: ParkingSpots (id: primary key decimal, garage: foreign key to Garage, description: varchar, status: varchar, type: varchar). Reservation: id: primary key decimal, user: foreign key to User, garage_id: fk to Garage, status: varchar (booked, paid), cost: decimal. Also a Garage: id: primary key decimal, name: varchar, rate: decimal. We can also have User: id: primary key decimal, email: varchar, name: varchar (optional)). Finally, we can also have a Vehicle table to ease returning customers and vehicles: plate: primary key varchar, type: varchar (small, etc.).



High-level design

As a high level, the user interacts via a mobile or webapp for reservations, which communicates to the backend service. For walk-ins, we can have a station with the web interface that the user can interact with. The backend is composed of a server (we can add more later for scalability and reliability if needed), and a database. given the small scale, and because we want high consistency (we don't want to have double booking for example), we should use a relational database like PostgresSQL or Oracle. For peak times, we can have multiple webservers and also read replicas for the database with high consistency. For example; UI -> webserver -> DB.




Request flows

Components and Flow: The user requests the number of spots available. The ui makes a web request (let's use REST) to the server. The server queries the database and aggregates the data, which it then sends to the UI in a Rest response. After verifying that there is a spot available that fits its car, the user selects a spot type and hits book. The ui makes a rest request, e.g. /book?type=large, to the webserver. The server makes a sql transaction to the db which effectively makes a reservation (and entry in the db) and changes the status to the parking spot to booked. The webserver returns a 200 rest response with the reservation id if successful, or with a message that the booking could not be made (possibly due to a race condition for example).


When the user arrives, he checks the intended spot on the UI and follows the signals of the garage to find its assigned spot. The UI will show clear descriptions with photos to get the section where the spot is located. Once in place, the system detects that a car parked on the spot and marks it busy automatically, to make it frictionless for the user. Alternatively, the UI could prompt the user to confirm. There is the unahppy path where the spot is occupied by a unexpected car, we could handle this with an option in the UI for the user to signal a problem so garage personal arrives to the spot. For walk-ins: the user uses the garage UI to get a ticket, then proceeds to park on a section of the garage that is reservations-free. Alternatively, the system can choose a spot after prompting the user to select its car type.




Detailed component design


The database will have a periodic job that will run an sql that will mark as free/occupied the spots with no active bookings. Oh I forgot, let me add a start and end time for each Reservation. The transaction will have a SQL query to select a spot that is free with a type according to the vehicle type. Then, another SQL will create a reservation with such spot, and another SQL query will mark the spot as booked. Everything happens in a transaction and the transaction is committed if all good, otherwise the transaction gets rolled back and nothing changes in the db.


We can add cron jobs that will automatically cancel reservations that are not used within a predefined time after their start. This can be configured in the Reservations table with a field expires_at: timestamp which is based on the start time and some configuration of the garage table via a reservations_expire_after: number of seconds from the start (decimal). "walk-in user parks in a reserved spot". This is certainly a problem. I propose to select another available (not booked) spot for the current reservation. This should be available as the system only allow reservations and walk-ins if there are spots available.


I would consider caching which I would add at the webservers levels. The amount of data to cache is very small. This way we save on complexity of the system and gains speed and reduce calls to the database. We can have a cache expiration rate of a 10seconds, meaning that we would perform very little amount of reads at the database. For write transactions, it could become a bottleneck. However, the amount of load is low and one write master should handle the load of a few thousand transactions per second. This would be a bit slower during peak times, but ensure high consistency. We can also partition or shard the data in multiple databases, for example by parking spot ids. It is important to use a partitioning scheme that distributes the load among all the db servers. We could also add indexes to speed reads per garage for example.


Trade offs/Tech choices

The garage will have physical sensors for each spot which will continuously check if there is a car in the spot and will trigger events. Depending of the budget, we can have redundancy, i.e. two sensors per spot to increase reliability. The system will automatically check the status of sensors by sensors' sending heartbeats. Another option is to have cameras. These would be a tradeoff, on one hand they allow new features such as car plate reading or security, but are more expensive to have, run and privacy concerns. "overwhelm the garage" when buying a walk-in, the system verifies that there are enough walk-in spots availables, or assign a free spot in the bookable section.

If a user overstays, the system will charge the user the appropriate amount upon user's exiting based on the garage rate and a penalty fee. There is the risk that the user overstays and the spot got booked for after the reservation was supposed to end. In that case, we handle it the same as when a user parks in the wrong spot. We can potentially tow a car if it's there after a preset limit. We could extend the system to send notifications to the users when their reservations are expiring and periodically after that.



Failure scenarios/bottlenecks

Physical Detection and Reliability: Some sensor failures are detected by the redundancy of two sensors in the same spot, while others will be detected by regular physical maintenance. circling endlessly: the walkin will choose another walkin spot which will get marked as occupied automatically by the system. 2. Database Transactions: yes, using select for update. For example: begin; select spot_id from ParkingSpot where status="free", garage=garage_id, type=vehicle_size; update ParkingSpot set status='booked'; end;. overstays: The reservations table will have two fields: check_in: timstamp, check_out: timestamp, which reflect the actual time the user checked in and out. These are used to compute the total amount with rates and fees etc. For payment failure, the user will contact garage assistance via an button or a person at the checkout place. In which case the user can pay to the person or have a pending charge. We will discuss with the thrid party payment service such as squareup to ensure payment, even when failed, for a fee. 3. Scalability and Cache: a sudden spike would not lead to double booking because the cache is only used to show available spots. However, upon actual booking, the transaction works based on the consistent view of the db, and thus the real-time data, not the cached. The transaction will fail, and a few seconds later, the cache will get refreshed, and users will see the refreshed data. "sharding by spot IDs": the spots would be distributed evenly among all db shards as the sharding is not based on garage, but on spot ids. The downside is that the reads will also have to be done on multiple db shard servers.


I would have a variable window for cache expiration, which would reduce down to 1 second if there number of free spots for a type fell below a threshold. sharding: for reading, we already have eventual consistency both due to read replicas and cache at the webservers layer. For sharding key for read replicas, we will use the spot_ids. The reason is to distribute the reads among the db servers. Using a garage_id has the disadvantage that the same server would get all the load for a peak event for a given garage, while the other servers would be idle. 4. Edge Cases and Fault Tolerance: to avoid such frustration, we add a feature so the user can extend the expiration of the booking before it expires, or after but before the spot is assigned to another reservation. The system only detects the mismatch by the user signalling that their spot is busy. There will be spots available as the system only allows people in if there are spots availables. If by some reason there is not, the personal of the garage will assist, possible have some emergency spots that are reserved for these situations. 5. Monitoring and Reliability: We add a logging datastore such as cloudwatch where the webservers store the logs. In particular the webservers will log 1. RTTs metrics of each API request/response, 2. http response codes (2xx success, 4xx, 5xx etc.). Then we will have dashboards showing the aggregated metrics (e.g. p99.99 for each api), and number of failures for reliability (availability) and scalability. We use timeseries system, dashboard, and alarms based on these metrics. We also add other metrics from the webservers (disk space, cpu load), the databases (same), and database RTT times.




Future improvements


Edge Cases: the garage will have a few spots reserved for unforeseen circumstances like the user trying to extend an expired reservation when there are no more spots available. The user will be able to extend their reservation and the system will exceptionally use one of this reserved spots. "emergency spots": yes, we will monitor the emergency spots use via the sensors and queries to the db. This will allow us to have insigths on their use to detect either system design flaws or even abuse. 5. Monitoring and Post-Mortem: the post-mortem analysis involves a comprehensive look at metrics for the webservers and dbs. We identify the source of the delay: if the latency at the db increased, the problem seems to be there, maybe caused by some dbs internal processes, otherwise, if the RTTs for requests from the perspective of the webserver increased, the problem can be in these requests (maybe a network issue or large returned data), otherwise it can be a the webserver level, maybe some bug, large data, or host issues (e.g. java garbage collection, full hdd, etc). "double-booked users": first we need to detect the problem as soon as possible. This can be done with some anti-entropy queries that check for double-booking or similar undesired data state. This will alert us as soon as it happens. Second, we have cron jobs that fix this issue automatically, by reassigning the spots. Then we address the root cause by researching what happened and where (db queries logic, servers' logic, etc.) and fix accordingly.