Loading Model only once in fastAPI
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
If a FastAPI endpoint loads a machine-learning model on every request, latency and memory use will be terrible. The standard pattern is to load the model once during application startup, store it in application state, and have request handlers reuse that shared object. The only caveat is that “once” means once per worker process, not once across an entire multi-process deployment.
Why Per-Request Loading Is the Wrong Design
A model load often involves disk I/O, deserialization, and memory allocation. Doing that inside the request handler means every call pays the startup cost again.
This is the pattern you want to avoid:
It works for a quick demo, but it does not scale. Even moderate traffic turns the model loader into the bottleneck.
Load the Model During Startup
FastAPI gives you a startup phase. Use it to initialize long-lived resources such as models, tokenizers, database pools, or vector indexes.
A clean modern approach is the lifespan hook:
Now the model is loaded once when the application starts and reused for every request handled by that process.
Use Application State Instead of a Bare Global
You will see examples that place the model in a module-level global variable. That can work, but app.state is usually cleaner because the ownership is explicit and easier to test.
A dependency wrapper can make handlers simpler:
This pattern is especially useful when several endpoints need the same model or when you want to override dependencies in tests.
Be Precise About Workers
If you run Uvicorn or Gunicorn with multiple workers, each worker is a separate process. Each process loads its own copy of the model during startup.
That means:
- '
--workers 1loads one copy' - '
--workers 4loads four copies'
This is usually the correct behavior, but it matters for memory planning. A large model that fits once may not fit four times.
If you need true cross-request sharing inside one process, startup state solves the problem. If you need cross-process sharing, that becomes a deployment and systems-design question rather than a FastAPI API question.
Keep Prediction Code Thread-Safe
A shared model instance is only safe if inference is safe under concurrent requests. Many common libraries such as scikit-learn models are fine for read-only prediction, but thread safety depends on the library and how you use it.
If the model mutates internal state during inference, or if you attach mutable caches without synchronization, shared access can become a bug source.
A lightweight prediction service often ends up looking like this:
The endpoint stays thin because startup already handled the expensive initialization.
Test Startup Behavior Explicitly
Do not assume the model is loaded just because the code looks correct. Run the app locally and verify that startup succeeds before the first request arrives. If model loading fails, FastAPI should fail fast rather than serving broken requests.
This is another reason to keep model loading out of the handler itself. Startup failure is much easier to diagnose than sporadic request-time failure.
Common Pitfalls
A common mistake is loading the model inside the endpoint and then wondering why latency spikes under load. The fix is architectural, not micro-optimization.
Another issue is using a global variable without understanding worker processes. In development you may see one model load, then deploy with multiple workers and suddenly memory usage multiplies.
Developers also sometimes perform long blocking startup work and assume the application is ready before the lifespan hook finishes. FastAPI is not ready until startup completes successfully.
Finally, a shared model object is not automatically safe just because it is shared. Verify the prediction library’s concurrency behavior before assuming one instance can serve all requests safely.
Summary
- Load the model during FastAPI startup, not inside each request handler.
- Store the model in
app.stateand retrieve it from the request or a dependency. - Expect one model instance per worker process, not one per entire deployment.
- Keep prediction handlers thin and use startup to perform the expensive initialization.
- Check concurrency and memory behavior before assuming the shared model design is production-ready.

