Deploying Keras Models via Google Cloud ML

Keras

Google Cloud

Machine Learning

Model Deployment

Cloud AI

Deploying Keras Models via Google Cloud ML

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Shipping a Keras model to Google Cloud is more than uploading a file and waiting for predictions. A usable deployment needs a reproducible export, a serving environment that matches training expectations, and a release process that can be validated and rolled back. The names of Google Cloud services have evolved over time, but the deployment discipline behind them has not.

Export a Model That Can Actually Be Served

The safest starting point is a local training script that produces a SavedModel artifact consistently. That means fixed preprocessing, known input order, and a recorded TensorFlow version. If the training notebook and the serving endpoint transform data differently, the endpoint may be healthy while every prediction is wrong.

python

1import numpy as np
2import tensorflow as tf
3from tensorflow import keras
4
5np.random.seed(7)
6tf.random.set_seed(7)
7
8x = np.random.rand(1000, 4).astype("float32")
9y = (x[:, 0] + x[:, 1] > 1.0).astype("float32")
10
11model = keras.Sequential([
12    keras.layers.Input(shape=(4,)),
13    keras.layers.Dense(16, activation="relu"),
14    keras.layers.Dense(1, activation="sigmoid"),
15])
16
17model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
18model.fit(x, y, epochs=3, batch_size=32, verbose=0)
19model.save("saved_model_v1")

Treat that export as a release artifact. Keep metadata with it, including feature order, label meaning, code revision, and training data window. Those details matter as much as the weights once the model is in production.

Upload the Artifact and Register It

After exporting the model, place it in Cloud Storage so the managed serving platform can access it. The exact command set depends on whether your project uses older ML Engine terminology or the newer Vertex AI workflow, but the core steps are the same: upload the artifact, register the model, and choose a compatible serving image.

bash

1PROJECT_ID="my-project"
2REGION="us-central1"
3BUCKET="my-ml-artifacts"
4
5gsutil -m cp -r saved_model_v1 gs://${BUCKET}/models/saved_model_v1
6
7gcloud ai models upload \
8  --project="${PROJECT_ID}" \
9  --region="${REGION}" \
10  --display-name="keras-demo-v1" \
11  --artifact-uri="gs://${BUCKET}/models/saved_model_v1" \
12  --container-image-uri="us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-14:latest"

The common failure here is mismatch: wrong region, wrong container runtime, or a model format that does not match the serving stack. When registration fails, verify the artifact contents first instead of guessing at IAM or networking issues.

Deploy to an Endpoint and Test Immediately

A registered model still is not serving traffic. You need an endpoint and a deployment step that binds the model version to compute resources. Once deployed, send known requests before you let real users hit it.

bash

1gcloud ai endpoints create \
2  --project="${PROJECT_ID}" \
3  --region="${REGION}" \
4  --display-name="keras-demo-endpoint"
5
6gcloud ai endpoints deploy-model ENDPOINT_ID \
7  --project="${PROJECT_ID}" \
8  --region="${REGION}" \
9  --model=MODEL_ID \
10  --display-name="keras-demo-v1" \
11  --machine-type="n1-standard-2" \
12  --traffic-split=0=100

Smoke tests should check more than HTTP success. Confirm input shape, output schema, value ranges, and latency. A binary classifier that suddenly returns constant scores is a deployment failure even if the API itself returns 200.

python

1import json
2import requests
3
4payload = {"instances": [[0.1, 0.2, 0.3, 0.4]]}
5response = requests.post(
6    "https://REGION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/REGION/endpoints/ENDPOINT_ID:predict",
7    headers={"Authorization": "Bearer TOKEN", "Content-Type": "application/json"},
8    data=json.dumps(payload),
9    timeout=30,
10)
11print(response.json())

Build for Versioning and Rollback

Most deployment pain shows up after the first successful launch. Models need version retention, rollback instructions, and monitoring for drift or latency regressions. Keep at least one previous good artifact available and document the exact endpoint or model ID needed to restore it. When a new model underperforms, you want an operational rollback, not an emergency retraining session.

It is also worth separating model quality checks from infrastructure checks. Accuracy validation belongs in the release process before traffic shifts. Serving health, request volume, latency, and error rates belong in runtime monitoring after deployment.

Common Pitfalls

Teams often break deployments by changing feature order between training and serving, choosing the wrong serving image, or skipping smoke tests against known examples. Another recurring problem is treating the model artifact as self-explanatory and failing to store the metadata needed to reproduce or roll back the release.

Summary

Export a reproducible SavedModel and keep its metadata with the artifact.
Upload and register the model with a serving runtime that matches the framework version.
Deploy to an endpoint and validate predictions with known requests before real traffic.
Separate runtime health monitoring from model-quality validation.
Keep previous model versions and rollback steps ready before each release.