Requirements
Functional Requirements:
- Users search for movies/events by location, date, genre
- Browse theaters, showtimes, and seat maps
- Select seats (with real-time availability)
- Hold seats temporarily during checkout
- Process payments and issue e-tickets
- Cancellation and refund flow
- Admin panel for theater partners to manage listings, inventory, pricing
Non-Functional Requirements:
- High concurrency: thousands of users competing for the same seats (e.g., a blockbuster release)
- Low latency on seat availability checks
- Strong consistency on booking — no double-selling a seat
- High availability (targeting 99.99%)
- Eventual consistency is acceptable for search/catalog but not for seat reservation
~50M monthly active users, ~5M daily
Peak: 50k+ concurrent seat selection requests for a hot show
~500k bookings/day, ~200M tickets/year
Average payload per booking: ~2KB → modest storage, but high write throughput during spikes
API Design
Search & Browse
GET /movies?city=&date=&genre=→ movie listingsGET /movies/{id}/showtimes?date=&venue_id=→ available showtimesGET /showtimes/{id}/seats→ seat map with real-time availability
Booking Flow
POST /showtimes/{id}/holdbody:{ seat_ids: [...] }→ temporarily locks seats, returns a hold_token with TTL (say 7 min)POST /bookingsbody:{ hold_token, payment_info }→ confirms booking, charges paymentDELETE /bookings/{id}→ cancellation
Admin
POST /venues/{id}/showtimes→ create showtimePUT /showtimes/{id}/pricing→ update pricing tiers
High-Level Design
User — id, name, email, phone, payment_methods
Movie/Event — id, title, genre, language, duration, rating, poster_url
Venue (Theater) — id, name, location (lat/lng), city, screens
Screen — id, venue_id, seat_map (rows × cols with categories like Gold/Silver)
Showtime — id, movie_id, screen_id, start_time, pricing_tiers
Seat Inventory — showtime_id + seat_id → status (AVAILABLE / HELD / BOOKED)
Booking — id, user_id, showtime_id, seats[], total_price, status, payment_id, created_at
Payment — id, booking_id, method, amount, status, gateway_txn_id
1. Keeping Browsing Fast During Peak Times
The browsing path (search → movie page → showtimes → seat map) is read-heavy and tolerant of slight staleness, which gives us a lot of room to optimize.
Multi-layer caching strategy:
At the edge, CDN caches all static assets like movie posters, theater logos, and the seat map SVG templates. These rarely change and can have long TTLs. Closer to the application, we put a reverse proxy cache (like Varnish or NGINX cache) in front of the Search Service for queries like "movies in Mumbai today." These results change maybe once a day when new showtimes are published, so a 5–10 minute TTL is perfectly fine.
For seat availability specifically, Redis holds a bitmap or hash per showtime where each seat maps to a status. When a user opens the seat map, the read hits Redis, not Postgres. This serves the vast majority of requests at sub-millisecond latency.
Isolating read and write paths:
Postgres replicas handle all read queries for the browsing flow — movie listings, showtime details, venue info. The primary only handles the critical booking writes. This means even if the booking path is under heavy write load during a popular release, the browsing experience for users looking at other movies is completely unaffected.
Elasticsearch for search:
Movie and showtime search goes through Elasticsearch, which is horizontally scalable. For a spike like a major release, we can pre-warm the ES cache by indexing the new showtimes ahead of time. Geo-queries for "theaters near me" use ES geo_distance filters, which are fast and don't touch Postgres at all.
Graceful degradation:
If Redis is overloaded, we can serve a slightly older snapshot of seat availability with a banner saying "availability may be approximate — exact status confirmed at checkout." The user still gets a fast experience, and correctness is enforced at the hold step.
2. User Flow from Search to Payment
Let me trace the complete journey:
Search & Discovery → User opens the app, we detect their city (GPS or saved preference). They see trending movies, filtered by language/genre. This is served from Elasticsearch with results cached at the API gateway layer. The user taps a movie and sees the detail page — poster, synopsis, reviews, all served from CDN + cache.
Showtime Selection → User picks a date. We fetch showtimes for that movie in their city, grouped by theater. This hits a Postgres read replica or a cached response. The user picks a showtime, say "PVR Juhu, 7:30 PM."
Seat Map & Selection → The seat map loads. The layout template (which seats exist, their categories) is static per screen and cached aggressively. The availability overlay comes from Redis — a lightweight call that returns a status array. The user taps seats; the client optimistically marks them as "selected" locally.
Hold Request → User hits "Proceed to Pay." The client sends POST /showtimes/{id}/hold with the selected seat IDs. The Booking Service attempts the atomic Postgres UPDATE. If successful, the user enters the payment flow with a 7-minute countdown displayed prominently. If the seats were already taken, the seat map refreshes with updated availability and the user picks again. This is the moment of truth — it's synchronous, consistent, and fast (single-digit millisecond DB operation).
Payment → The Payment Service first places an authorization hold on the user's card via the payment gateway. No money moves yet. On auth success, we confirm the booking — seats transition from HELD to BOOKED, and we capture the payment. The user sees a confirmation screen with a booking ID.
Post-booking (async) → A booking confirmation event hits Kafka. The Notification Service sends email and SMS. The Ticket Generation Service creates a QR code PDF and stores it in S3, linking it to the booking. The user can access their e-ticket from the app immediately.
The key UX insight is that the user only experiences one "moment of friction" — the hold request. Everything before it is fast cached reads, and everything after it is async.
3. Ensuring Stale Cache Doesn't Sell a Seat
This is the most important correctness question. The short answer: the cache is never the source of truth for selling — it's only a visual hint.
The cache's role is explicitly limited:
When a user views the seat map, Redis shows approximate availability. It might show a seat as available that was actually booked 2 seconds ago. That's fine, because no seat is actually sold based on what Redis says. The moment the user tries to book, the hold request goes to Postgres, which is the single source of truth.
The Postgres atomic check is the gate:
sql
UPDATE seat_inventory
SET status = 'HELD', hold_token = :token, hold_expires_at = NOW() + '7 min'
WHERE showtime_id = :sid
AND seat_id = ANY(:seats)
AND status = 'AVAILABLE';
-- If affected_rows < requested_count → rollback, return failure
This runs inside a transaction with row-level locks. Even if 500 users all see seat A1 as "available" in the cache and all click it simultaneously, exactly one of them gets the hold. The other 499 get an immediate failure response and a refreshed seat map.
Cache invalidation on writes:
After every successful hold or booking, we synchronously update Redis before returning the response to the user. So the window of staleness is very small — typically under 100ms. For users loading the seat map after that point, they see the correct state.
Defense in depth:
Even if Redis update fails (network blip), the worst case is that another user sees a false-available seat and tries to book it, only to get a clean rejection from Postgres. No double-sell ever happens. The invariant is maintained at the database layer, not the cache layer.
Think of it like an airline seat map — the website might show a seat as available, but the actual reservation system catches conflicts at booking time. The cache is an optimization for UX, not a mechanism for correctness.
Great follow-up questions — these are exactly the kind of probes an interviewer would use to stress-test the design. Let me address each one thoroughly.
1. Keeping Browsing Fast During Peak Times
The browsing path (search → movie page → showtimes → seat map) is read-heavy and tolerant of slight staleness, which gives us a lot of room to optimize.
Multi-layer caching strategy:
At the edge, CDN caches all static assets like movie posters, theater logos, and the seat map SVG templates. These rarely change and can have long TTLs. Closer to the application, we put a reverse proxy cache (like Varnish or NGINX cache) in front of the Search Service for queries like "movies in Mumbai today." These results change maybe once a day when new showtimes are published, so a 5–10 minute TTL is perfectly fine.
For seat availability specifically, Redis holds a bitmap or hash per showtime where each seat maps to a status. When a user opens the seat map, the read hits Redis, not Postgres. This serves the vast majority of requests at sub-millisecond latency.
Isolating read and write paths:
Postgres replicas handle all read queries for the browsing flow — movie listings, showtime details, venue info. The primary only handles the critical booking writes. This means even if the booking path is under heavy write load during a popular release, the browsing experience for users looking at other movies is completely unaffected.
Elasticsearch for search:
Movie and showtime search goes through Elasticsearch, which is horizontally scalable. For a spike like a major release, we can pre-warm the ES cache by indexing the new showtimes ahead of time. Geo-queries for "theaters near me" use ES geo_distance filters, which are fast and don't touch Postgres at all.
Graceful degradation:
If Redis is overloaded, we can serve a slightly older snapshot of seat availability with a banner saying "availability may be approximate — exact status confirmed at checkout." The user still gets a fast experience, and correctness is enforced at the hold step.
2. User Flow from Search to Payment
Let me trace the complete journey:
Search & Discovery → User opens the app, we detect their city (GPS or saved preference). They see trending movies, filtered by language/genre. This is served from Elasticsearch with results cached at the API gateway layer. The user taps a movie and sees the detail page — poster, synopsis, reviews, all served from CDN + cache.
Showtime Selection → User picks a date. We fetch showtimes for that movie in their city, grouped by theater. This hits a Postgres read replica or a cached response. The user picks a showtime, say "PVR Juhu, 7:30 PM."
Seat Map & Selection → The seat map loads. The layout template (which seats exist, their categories) is static per screen and cached aggressively. The availability overlay comes from Redis — a lightweight call that returns a status array. The user taps seats; the client optimistically marks them as "selected" locally.
Hold Request → User hits "Proceed to Pay." The client sends POST /showtimes/{id}/hold with the selected seat IDs. The Booking Service attempts the atomic Postgres UPDATE. If successful, the user enters the payment flow with a 7-minute countdown displayed prominently. If the seats were already taken, the seat map refreshes with updated availability and the user picks again. This is the moment of truth — it's synchronous, consistent, and fast (single-digit millisecond DB operation).
Payment → The Payment Service first places an authorization hold on the user's card via the payment gateway. No money moves yet. On auth success, we confirm the booking — seats transition from HELD to BOOKED, and we capture the payment. The user sees a confirmation screen with a booking ID.
Post-booking (async) → A booking confirmation event hits Kafka. The Notification Service sends email and SMS. The Ticket Generation Service creates a QR code PDF and stores it in S3, linking it to the booking. The user can access their e-ticket from the app immediately.
The key UX insight is that the user only experiences one "moment of friction" — the hold request. Everything before it is fast cached reads, and everything after it is async.
3. Ensuring Stale Cache Doesn't Sell a Seat
This is the most important correctness question. The short answer: the cache is never the source of truth for selling — it's only a visual hint.
The cache's role is explicitly limited:
When a user views the seat map, Redis shows approximate availability. It might show a seat as available that was actually booked 2 seconds ago. That's fine, because no seat is actually sold based on what Redis says. The moment the user tries to book, the hold request goes to Postgres, which is the single source of truth.
The Postgres atomic check is the gate:
sql
UPDATE seat_inventory
SET status = 'HELD', hold_token = :token, hold_expires_at = NOW() + '7 min'
WHERE showtime_id = :sid
AND seat_id = ANY(:seats)
AND status = 'AVAILABLE';
-- If affected_rows < requested_count → rollback, return failure
This runs inside a transaction with row-level locks. Even if 500 users all see seat A1 as "available" in the cache and all click it simultaneously, exactly one of them gets the hold. The other 499 get an immediate failure response and a refreshed seat map.
Cache invalidation on writes:
After every successful hold or booking, we synchronously update Redis before returning the response to the user. So the window of staleness is very small — typically under 100ms. For users loading the seat map after that point, they see the correct state.
Defense in depth:
Even if Redis update fails (network blip), the worst case is that another user sees a false-available seat and tries to book it, only to get a clean rejection from Postgres. No double-sell ever happens. The invariant is maintained at the database layer, not the cache layer.
Think of it like an airline seat map — the website might show a seat as available, but the actual reservation system catches conflicts at booking time. The cache is an optimization for UX, not a mechanism for correctness.
4. Core Booking Functions During Outages
The philosophy here is to identify what must work (seat holds and payment capture) versus what can degrade gracefully (search, notifications, analytics).
Database resilience:
Postgres runs in a primary-replica configuration with synchronous replication to at least one standby. If the primary fails, we do an automated failover (using Patroni or AWS RDS Multi-AZ). The booking path is unavailable for maybe 15–30 seconds during failover — we show users a "please retry" message rather than silently failing. This is acceptable because a wrong answer (double-sell) is far worse than a brief pause.
Payment gateway failures:
This is the most common real-world outage scenario. We handle it with a multi-gateway strategy — Stripe as primary, Razorpay or Adyen as fallback. If the primary gateway times out, the Payment Service automatically routes to the backup. Idempotency keys ensure that if a charge was actually processed but we didn't get the response, a retry doesn't double-charge.
If all payment gateways are down, the seats remain in HELD status. The user sees "payment processing delayed." We enqueue the payment attempt in a persistent queue (Kafka or SQS) and process it when the gateway recovers — as long as it's within the hold window.
Redis outage:
If Redis goes down, the seat map endpoint falls back to reading directly from Postgres replicas. Latency goes from ~2ms to ~20ms. Noticeable but not broken. The hold and booking path doesn't depend on Redis at all — it always goes to Postgres primary.
Notification/ticket generation failures:
These are fully async and idempotent. If Kafka is down, the Booking Service writes the event to a local outbox table in Postgres (transactional outbox pattern). A poller picks it up when Kafka recovers. The user's booking is confirmed regardless — they just might get their email confirmation a few minutes late.
Circuit breakers and bulkheads:
Each downstream dependency (payment gateway, notification service, Elasticsearch) is wrapped in a circuit breaker. If a dependency starts failing, the breaker opens and we stop sending traffic to it immediately, preventing cascade failures. Bulkheads ensure that a spike in the booking path doesn't starve the search path of resources — they run on separate thread pools and separate service instances.
The priority hierarchy during an outage is:
Protect booking integrity (never double-sell) above all else, then keep the payment flow working, then keep browsing functional even if degraded, and finally let notifications and analytics catch up later. Everything below the booking layer is designed to be eventually consistent and retry-safe.
Detailed Component Design
Step 1: User views seat map → GET /showtimes/{id}/seats → Booking Service reads from Redis (seat status cache). Fast, but slightly stale is okay here — it's just a visual hint.
Step 2: User selects seats and hits "Book" → POST /showtimes/{id}/hold { seat_ids: [A1, A2] } → Booking Service attempts an atomic operation:
sql
-- Postgres with row-level locking
UPDATE seat_inventory
SET status = 'HELD', hold_token = :token, hold_expires_at = NOW() + interval '7 min'
WHERE showtime_id = :sid AND seat_id IN ('A1','A2') AND status = 'AVAILABLE';
-- Check affected rows == requested count, else rollback
This is the serialization point. Only one user wins. The loser gets a "seats unavailable" response immediately. Redis is updated after the write.
Step 3: User completes payment within hold window → POST /bookings { hold_token, payment } → Payment Service charges the card → on success, Booking Service transitions seats from HELD → BOOKED. → Event published to Kafka → Notification Service sends confirmation email/SMS + Ticket Gen Service creates QR code PDF.
Step 4: Hold expires (user abandoned) → A background Hold Expiry Worker runs every 30s scanning for expired holds and flips them back to AVAILABLE. Alternatively, use Redis key TTL as a trigger.
Handling hot showtimes (the "Avengers problem"):
- A single showtime's seats fit in a few KB, so a single Postgres row-level lock is viable up to a point.
- For extreme spikes, shard seat inventory by showtime_id — each showtime's seats live on one shard, so the lock contention is isolated.
- Add a virtual waiting room (queue-based admission) for mega-popular releases: users get a random position, admitted in batches. This converts a stampede into a controlled flow.
Redis usage:
- Seat availability cache per showtime (refreshed on every write). Reads hit Redis, writes go to Postgres then invalidate Redis.
- Hold token TTLs as a backup expiration mechanism.
Search:
- Elasticsearch indexes movies, venues, showtimes. Updated asynchronously via CDC (Change Data Capture) from Postgres.
- Geo queries for "theaters near me" using ES geo_point.
Payments:
- Idempotency keys on every payment request to handle retries safely.
- Two-phase: authorize during hold, capture on confirmation. If hold expires, release the auth.
Availability & fault tolerance:
- Postgres primary-replica with synchronous replication for the booking path.
- If Redis goes down, fall back to Postgres reads (slower but correct).
- Payment failures → seats stay HELD until TTL expires, user can retry.
- Circuit breaker on payment gateway calls.
Observability:
- Track key metrics: seat hold success rate, payment conversion rate, hold expiry rate, p99 latency on the booking path.
- Alert on double-book anomalies (should be zero, ever).