Codemia | Master System Design Interviews Through Active Practice

My Solution for Design a Log Collection and Analysis System with Score: 4/10

by pulse_drift615

System requirements

Functional:

Applications should be-able to send logs to the system.
Users should be-able to query logs individually, as well as doing aggregations.
User should beable to specify regex'es for extracting metrics and attributes from logs.
User should be-able to set alerts on metrics that are extractable from logs.
We are NOT required to log from mobile applications or clients. This is for server-side applications.

Non-Functional:

List non-functional requirements for the system...

Applications should be-able to send logs to the collection system with high availability. Logs should not be lost.
Strict ordering of logs across endpoints is not needed.
Latency of 60 seconds between log ingestion and the log being available in a query
Scale of 50,000 logs per-second

Capacity estimation

Estimate the scale of the system you are going to design...

API design

POST /log # to send logs into the system

{

Time-stamp
Severity
Hostname
Service
Message

}

# To manage configured rules for log parsing.

POST /rule -> rule-id

{

service
regex-rule (including metrics and attributes)

}

DEL /rule?rule-id=:rule-id

GET /rules[?service=:service] -> Rules[]

GET /logs?startTime=:start&endTime=:end&service=:service[&cursor=:cursor][&hostname=:hostname]

POST /query

{

where_clause: List of List of Query-Expressions (or of ands)

select_clause: List of attributes (including aggregations)

}

Database design

Logs (emitted by applications)

Time-stamp
Severity
Hostname
Service
Message

Log-Schemas (configured)

rule-id (primary key)
service
regex rule (including attributes/metric to extract)

High-level design

End points send logs into Kafka where they are persisted in their raw form for up to seven days. These logs are picked up from Kafka by log processors which apply the passing rules from the rules, database to extract metrics and attributes from each log message. Those metrics in attributes are stored in Cassandra.

query, clients configure, passing rules via the front end, which are stored in a postgres database. They can also issue queries that are served by the query engine by reading from Cassandra and doing the appropriate aggregations according to the select and where clause.

We are using postgres to store the rules. This should not require very high scale. We don't expect more than a few hundred rules and rules change rarely.

We are using Kaka to persist the raw logs and provide the decoupling between ingesting and processing in case of spikes. Consumer groups are configured in logprocessors to pick up these raw logs and store them in Cassandra. We create secondary index in Cassandra as per the rules DB to make query easier. The queries are done by the front end using CQL, including processing of where and select classes

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?