Selecting a node size for a GKE kubernetes cluster

GKE

Kubernetes

Node Size

Cluster Management

Google Cloud

Selecting a node size for a GKE kubernetes cluster

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Choosing a node size for GKE is really a scheduling and cost problem, not just a VM-picking problem. You want nodes that fit your pods efficiently, leave room for system overhead, and fail gracefully when a node disappears. A good node size comes from workload data, not intuition.

Start with Pod Requests

The most useful inputs are:

CPU request per pod
memory request per pod
ephemeral storage request per pod
expected replica count
per-node overhead from DaemonSets and system components

If these numbers are wrong, node sizing will also be wrong. In both GKE Standard and Autopilot, resource requests drive placement decisions.

Here is a simple deployment example:

yaml

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: api
5spec:
6  replicas: 6
7  template:
8    spec:
9      containers:
10        - name: api
11          image: us-docker.pkg.dev/example/api:latest
12          resources:
13            requests:
14              cpu: "500m"
15              memory: "1Gi"
16            limits:
17              cpu: "1"
18              memory: "2Gi"

That tells you much more about the right node size than a generic "small app" label ever will.

Match the Machine Family to the Workload

The exact machine families available depend on region and product support, so the stable way to reason about node size is by workload shape:

Workload Shape	What to Favor	Why
General web and API services	General-purpose nodes	Balanced CPU and memory
CPU-bound workers	Compute-optimized nodes	Better CPU density
In-memory services	Memory-optimized nodes	More RAM per vCPU
GPU or accelerator workloads	Accelerator-backed nodes	Specialized hardware support

Choose the category first, then choose the size within that category.

Few Large Nodes Versus Many Small Nodes

This trade-off matters a lot:

Larger nodes reduce per-node overhead.
Smaller nodes reduce blast radius during failures or upgrades.
Tiny nodes can waste resources on logging, monitoring, and CNI overhead.
Very large nodes can be harder to fill efficiently and can cause bigger disruption when one node is drained.

In practice, moderate-size nodes are often the safest starting point unless you have a clear reason to bias in one direction.

A Practical Sizing Workflow

Suppose each pod requests:

500m CPU
1Gi memory

and each node also has:

DaemonSet overhead
kube-system overhead
some reserved headroom for rolling updates and burst

The workflow is:

Estimate how many pods you want per node.
Add per-node overhead.
Avoid planning to 100 percent utilization.
Check whether losing one node would evict too much workload at once.

If one node failure would knock out a large share of your service, the node size is probably too large for that workload.

Node Pools Usually Beat One Global Node Size

Many clusters should not have one "best" node size. If you run mixed workloads, separate node pools are cleaner.

Examples:

API services on general-purpose nodes
memory-heavy workers on memory-optimized nodes
batch jobs on a cheaper autoscaled pool

This lets you tune autoscaling, labels, taints, and cost strategy independently.

Other Constraints People Forget

Pod density limits can matter before CPU or memory does.
Local storage and disk throughput can become bottlenecks.
DaemonSets consume resources on every node, so very small nodes can be inefficient.
Cluster autoscaler helps, but it does not fix bad requests or bad workload separation.

Common Pitfalls

Choosing based on VM specs without first measuring pod requests.
Packing nodes too tightly and leaving no room for upgrades or transient spikes.
Using one node pool for very different workloads.
Optimizing only for hourly price instead of total cluster efficiency.

Summary

Start from pod requests, not machine labels.
Match the machine family to the workload shape.
Balance fewer large nodes against more small nodes.
Use node pools when workloads differ materially.
Leave headroom for system overhead, scaling, and disruption.