Pytorch
tensor computation
efficient coding
machine learning
deep learning

How can I compute the tensor in Pytorch efficiently?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Efficient tensor computation in PyTorch means using GPU acceleration, avoiding unnecessary copies, leveraging vectorized operations over Python loops, and managing memory carefully. The biggest performance wins come from moving tensors to GPU with .to(device), replacing explicit loops with broadcasting and torch.einsum, using in-place operations where safe, and enabling mixed-precision training. Understanding these patterns is essential for training large models in reasonable time.

Move Tensors to GPU

python
1import torch
2
3device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
4
5# Create tensor directly on GPU
6x = torch.randn(1000, 1000, device=device)
7
8# Move existing tensor to GPU
9y = torch.randn(1000, 1000)
10y = y.to(device)  # Returns a new tensor on GPU
11
12# All operations on GPU tensors run on GPU
13z = x @ y  # Matrix multiplication on GPU — much faster

GPU computation is 10-100x faster than CPU for large tensor operations due to massive parallelism.

Vectorized Operations Over Loops

python
1# SLOW — Python loop
2import time
3
4x = torch.randn(10000)
5start = time.time()
6result = torch.zeros(10000)
7for i in range(10000):
8    result[i] = x[i] ** 2 + x[i] * 3
9print(f"Loop: {time.time() - start:.4f}s")  # ~0.05s
10
11# FAST — Vectorized
12start = time.time()
13result = x ** 2 + x * 3
14print(f"Vectorized: {time.time() - start:.4f}s")  # ~0.0001s

Vectorized operations dispatch to optimized C++/CUDA kernels. Python loops add interpreter overhead per element.

Broadcasting

Broadcasting automatically expands tensors to compatible shapes without copying data:

python
1# Add a bias vector to every row of a matrix
2matrix = torch.randn(1000, 512)  # (1000, 512)
3bias = torch.randn(512)          # (512,)
4result = matrix + bias            # Broadcasting: (1000, 512) + (512,) → (1000, 512)
5
6# No need to expand manually — broadcasting handles it
7# This is slower and uses more memory:
8# result = matrix + bias.unsqueeze(0).expand(1000, 512)

torch.einsum for Complex Operations

python
1# Batch matrix multiplication
2A = torch.randn(32, 100, 64)   # (batch, seq, hidden)
3B = torch.randn(32, 64, 128)   # (batch, hidden, out)
4
5# Using einsum — clear and efficient
6result = torch.einsum('bij,bjk->bik', A, B)  # (32, 100, 128)
7
8# Equivalent but less readable
9result = torch.bmm(A, B)
10
11# Attention score computation
12queries = torch.randn(32, 8, 100, 64)   # (batch, heads, seq, dim)
13keys = torch.randn(32, 8, 100, 64)
14
15scores = torch.einsum('bhqd,bhkd->bhqk', queries, keys)
16# (32, 8, 100, 100) — attention matrix

einsum generates optimized BLAS/CUDA calls and avoids intermediate tensors.

In-Place Operations

python
1# Out-of-place: creates a new tensor (more memory)
2x = x + 1
3x = torch.relu(x)
4
5# In-place: modifies tensor directly (saves memory)
6x.add_(1)          # x += 1
7x.relu_()          # x = relu(x)
8x.clamp_(min=0)    # x = clamp(x, min=0)
9x.mul_(0.5)        # x *= 0.5

In-place operations (trailing _) save memory by not allocating a new tensor. However, they can break autograd if the tensor is needed for gradient computation. Use them primarily in inference or for tensors not requiring gradients.

Mixed Precision Training

python
1from torch.cuda.amp import autocast, GradScaler
2
3model = model.to(device)
4optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
5scaler = GradScaler()
6
7for batch in dataloader:
8    inputs, targets = batch[0].to(device), batch[1].to(device)
9
10    optimizer.zero_grad()
11
12    # Forward pass in float16 — 2x faster, half memory
13    with autocast():
14        outputs = model(inputs)
15        loss = criterion(outputs, targets)
16
17    # Backward pass with gradient scaling
18    scaler.scale(loss).backward()
19    scaler.step(optimizer)
20    scaler.update()

Mixed precision uses float16 for forward/backward passes and float32 for weight updates. This roughly doubles training speed on modern GPUs.

Efficient Data Loading

python
1from torch.utils.data import DataLoader
2
3dataloader = DataLoader(
4    dataset,
5    batch_size=64,
6    shuffle=True,
7    num_workers=4,        # Parallel data loading
8    pin_memory=True,      # Faster CPU→GPU transfer
9    persistent_workers=True,  # Keep workers alive between epochs
10    prefetch_factor=2,    # Prefetch 2 batches per worker
11)
12
13# Non-blocking transfer
14for batch in dataloader:
15    inputs = batch[0].to(device, non_blocking=True)
16    targets = batch[1].to(device, non_blocking=True)
17    # GPU processes previous batch while next batch transfers

pin_memory=True + non_blocking=True enables asynchronous CPU-to-GPU transfers, overlapping data loading with computation.

Avoid Unnecessary Gradient Tracking

python
1# During inference, disable gradient computation
2with torch.no_grad():
3    predictions = model(test_input)
4    # 2x faster, uses less memory — no gradient graph built
5
6# For specific tensors that never need gradients
7embedding = torch.randn(10000, 300, requires_grad=False)
8
9# Detach tensors from computation graph when storing
10cached_features = model.encode(x).detach()

Efficient Matrix Operations

python
1# Use torch.matmul or @ for matrix multiplication
2C = A @ B                     # Recommended
3C = torch.matmul(A, B)       # Equivalent
4C = torch.mm(A, B)           # 2D only, no broadcasting
5
6# Batch matrix multiply
7C = torch.bmm(A, B)          # 3D: (batch, m, k) @ (batch, k, n)
8
9# Fused operations when available
10# addmm: C = beta*C + alpha*(A @ B) — single kernel
11C = torch.addmm(bias, A, B)

Memory Management

python
1# Free GPU memory
2del large_tensor
3torch.cuda.empty_cache()
4
5# Check GPU memory usage
6print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
7print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
8
9# Gradient checkpointing — trade compute for memory
10from torch.utils.checkpoint import checkpoint
11
12class LargeModel(nn.Module):
13    def forward(self, x):
14        # Recompute activations during backward instead of storing
15        x = checkpoint(self.layer1, x)
16        x = checkpoint(self.layer2, x)
17        return self.layer3(x)

torch.compile (PyTorch 2.0+)

python
1# Compile model for optimized execution
2model = torch.compile(model)
3
4# Specific backends
5model = torch.compile(model, backend="inductor")      # Default, good for most cases
6model = torch.compile(model, mode="reduce-overhead")   # Minimize overhead
7model = torch.compile(model, mode="max-autotune")      # Maximum optimization (slow compile)

torch.compile fuses operations, optimizes memory access patterns, and generates optimized CUDA kernels automatically.

Common Pitfalls

  • CPU-GPU data transfer in loops: Moving small tensors between CPU and GPU in a loop kills performance. Batch operations on GPU and transfer only results back to CPU.
  • Using Python lists instead of tensors: [tensor1, tensor2, ...] followed by torch.stack() is slower than pre-allocating a tensor and filling it. Use torch.zeros(n, ...) and index assignment.
  • .item() in training loops: loss.item() synchronizes CPU and GPU. Call it every N steps for logging, not every step.
  • Forgetting model.eval() during inference: Without eval(), batch norm and dropout still run in training mode, giving wrong results and wasting computation.
  • Using torch.tensor() inside a loop: Each call creates a new tensor with gradient tracking overhead. Pre-allocate outside the loop.

Summary

  • Move tensors to GPU with .to(device) for 10-100x speedup on large operations
  • Use vectorized operations and broadcasting instead of Python loops
  • Use torch.einsum for readable, efficient multi-dimensional operations
  • Enable mixed precision (autocast + GradScaler) for 2x training speed
  • Set pin_memory=True and num_workers > 0 in DataLoader for faster data loading
  • Use torch.no_grad() during inference and torch.compile() (PyTorch 2.0+) for automatic optimization

Course illustration
Course illustration

All Rights Reserved.