Loading Model only once in fastAPI

FastAPI

Model Loading

Machine Learning

Optimization

Python

Loading Model only once in fastAPI

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

If a FastAPI endpoint loads a machine-learning model on every request, latency and memory use will be terrible. The standard pattern is to load the model once during application startup, store it in application state, and have request handlers reuse that shared object. The only caveat is that “once” means once per worker process, not once across an entire multi-process deployment.

Why Per-Request Loading Is the Wrong Design

A model load often involves disk I/O, deserialization, and memory allocation. Doing that inside the request handler means every call pays the startup cost again.

This is the pattern you want to avoid:

python

1from fastapi import FastAPI
2import joblib
3
4app = FastAPI()
5
6@app.post("/predict")
7def predict(features: list[float]):
8    model = joblib.load("model.joblib")
9    return {"prediction": model.predict([features]).tolist()}

It works for a quick demo, but it does not scale. Even moderate traffic turns the model loader into the bottleneck.

Load the Model During Startup

FastAPI gives you a startup phase. Use it to initialize long-lived resources such as models, tokenizers, database pools, or vector indexes.

A clean modern approach is the lifespan hook:

python

1from contextlib import asynccontextmanager
2import joblib
3from fastapi import FastAPI, Request
4
5@asynccontextmanager
6async def lifespan(app: FastAPI):
7    app.state.model = joblib.load("model.joblib")
8    yield
9    app.state.model = None
10
11app = FastAPI(lifespan=lifespan)
12
13@app.post("/predict")
14def predict(features: list[float], request: Request):
15    model = request.app.state.model
16    prediction = model.predict([features]).tolist()
17    return {"prediction": prediction}

Now the model is loaded once when the application starts and reused for every request handled by that process.

Use Application State Instead of a Bare Global

You will see examples that place the model in a module-level global variable. That can work, but app.state is usually cleaner because the ownership is explicit and easier to test.

A dependency wrapper can make handlers simpler:

python

1from fastapi import Depends, Request
2
3
4def get_model(request: Request):
5    return request.app.state.model
6
7@app.post("/predict")
8def predict(features: list[float], model = Depends(get_model)):
9    return {"prediction": model.predict([features]).tolist()}

This pattern is especially useful when several endpoints need the same model or when you want to override dependencies in tests.

Be Precise About Workers

If you run Uvicorn or Gunicorn with multiple workers, each worker is a separate process. Each process loads its own copy of the model during startup.

That means:

'--workers 1 loads one copy'
'--workers 4 loads four copies'

This is usually the correct behavior, but it matters for memory planning. A large model that fits once may not fit four times.

If you need true cross-request sharing inside one process, startup state solves the problem. If you need cross-process sharing, that becomes a deployment and systems-design question rather than a FastAPI API question.

Keep Prediction Code Thread-Safe

A shared model instance is only safe if inference is safe under concurrent requests. Many common libraries such as scikit-learn models are fine for read-only prediction, but thread safety depends on the library and how you use it.

If the model mutates internal state during inference, or if you attach mutable caches without synchronization, shared access can become a bug source.

A lightweight prediction service often ends up looking like this:

python

1import numpy as np
2from fastapi import FastAPI, Request
3
4@app.post("/predict")
5def predict(features: list[float], request: Request):
6    model = request.app.state.model
7    vector = np.array(features, dtype=float).reshape(1, -1)
8    value = model.predict(vector)[0]
9    return {"prediction": float(value)}

The endpoint stays thin because startup already handled the expensive initialization.

Test Startup Behavior Explicitly

Do not assume the model is loaded just because the code looks correct. Run the app locally and verify that startup succeeds before the first request arrives. If model loading fails, FastAPI should fail fast rather than serving broken requests.

This is another reason to keep model loading out of the handler itself. Startup failure is much easier to diagnose than sporadic request-time failure.

Common Pitfalls

A common mistake is loading the model inside the endpoint and then wondering why latency spikes under load. The fix is architectural, not micro-optimization.

Another issue is using a global variable without understanding worker processes. In development you may see one model load, then deploy with multiple workers and suddenly memory usage multiplies.

Developers also sometimes perform long blocking startup work and assume the application is ready before the lifespan hook finishes. FastAPI is not ready until startup completes successfully.

Finally, a shared model object is not automatically safe just because it is shared. Verify the prediction library’s concurrency behavior before assuming one instance can serve all requests safely.

Summary

Load the model during FastAPI startup, not inside each request handler.
Store the model in app.state and retrieve it from the request or a dependency.
Expect one model instance per worker process, not one per entire deployment.
Keep prediction handlers thin and use startup to perform the expensive initialization.
Check concurrency and memory behavior before assuming the shared model design is production-ready.