Introduction
Efficient tensor computation in PyTorch means using GPU acceleration, avoiding unnecessary copies, leveraging vectorized operations over Python loops, and managing memory carefully. The biggest performance wins come from moving tensors to GPU with .to(device), replacing explicit loops with broadcasting and torch.einsum, using in-place operations where safe, and enabling mixed-precision training. Understanding these patterns is essential for training large models in reasonable time.
Move Tensors to GPU
1import torch
2
3device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
4
5# Create tensor directly on GPU
6x = torch.randn(1000, 1000, device=device)
7
8# Move existing tensor to GPU
9y = torch.randn(1000, 1000)
10y = y.to(device) # Returns a new tensor on GPU
11
12# All operations on GPU tensors run on GPU
13z = x @ y # Matrix multiplication on GPU — much faster
GPU computation is 10-100x faster than CPU for large tensor operations due to massive parallelism.
Vectorized Operations Over Loops
1# SLOW — Python loop
2import time
3
4x = torch.randn(10000)
5start = time.time()
6result = torch.zeros(10000)
7for i in range(10000):
8 result[i] = x[i] ** 2 + x[i] * 3
9print(f"Loop: {time.time() - start:.4f}s") # ~0.05s
10
11# FAST — Vectorized
12start = time.time()
13result = x ** 2 + x * 3
14print(f"Vectorized: {time.time() - start:.4f}s") # ~0.0001s
Vectorized operations dispatch to optimized C++/CUDA kernels. Python loops add interpreter overhead per element.
Broadcasting
Broadcasting automatically expands tensors to compatible shapes without copying data:
1# Add a bias vector to every row of a matrix
2matrix = torch.randn(1000, 512) # (1000, 512)
3bias = torch.randn(512) # (512,)
4result = matrix + bias # Broadcasting: (1000, 512) + (512,) → (1000, 512)
5
6# No need to expand manually — broadcasting handles it
7# This is slower and uses more memory:
8# result = matrix + bias.unsqueeze(0).expand(1000, 512)
torch.einsum for Complex Operations
1# Batch matrix multiplication
2A = torch.randn(32, 100, 64) # (batch, seq, hidden)
3B = torch.randn(32, 64, 128) # (batch, hidden, out)
4
5# Using einsum — clear and efficient
6result = torch.einsum('bij,bjk->bik', A, B) # (32, 100, 128)
7
8# Equivalent but less readable
9result = torch.bmm(A, B)
10
11# Attention score computation
12queries = torch.randn(32, 8, 100, 64) # (batch, heads, seq, dim)
13keys = torch.randn(32, 8, 100, 64)
14
15scores = torch.einsum('bhqd,bhkd->bhqk', queries, keys)
16# (32, 8, 100, 100) — attention matrix
einsum generates optimized BLAS/CUDA calls and avoids intermediate tensors.
In-Place Operations
1# Out-of-place: creates a new tensor (more memory)
2x = x + 1
3x = torch.relu(x)
4
5# In-place: modifies tensor directly (saves memory)
6x.add_(1) # x += 1
7x.relu_() # x = relu(x)
8x.clamp_(min=0) # x = clamp(x, min=0)
9x.mul_(0.5) # x *= 0.5
In-place operations (trailing _) save memory by not allocating a new tensor. However, they can break autograd if the tensor is needed for gradient computation. Use them primarily in inference or for tensors not requiring gradients.
Mixed Precision Training
1from torch.cuda.amp import autocast, GradScaler
2
3model = model.to(device)
4optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
5scaler = GradScaler()
6
7for batch in dataloader:
8 inputs, targets = batch[0].to(device), batch[1].to(device)
9
10 optimizer.zero_grad()
11
12 # Forward pass in float16 — 2x faster, half memory
13 with autocast():
14 outputs = model(inputs)
15 loss = criterion(outputs, targets)
16
17 # Backward pass with gradient scaling
18 scaler.scale(loss).backward()
19 scaler.step(optimizer)
20 scaler.update()
Mixed precision uses float16 for forward/backward passes and float32 for weight updates. This roughly doubles training speed on modern GPUs.
Efficient Data Loading
1from torch.utils.data import DataLoader
2
3dataloader = DataLoader(
4 dataset,
5 batch_size=64,
6 shuffle=True,
7 num_workers=4, # Parallel data loading
8 pin_memory=True, # Faster CPU→GPU transfer
9 persistent_workers=True, # Keep workers alive between epochs
10 prefetch_factor=2, # Prefetch 2 batches per worker
11)
12
13# Non-blocking transfer
14for batch in dataloader:
15 inputs = batch[0].to(device, non_blocking=True)
16 targets = batch[1].to(device, non_blocking=True)
17 # GPU processes previous batch while next batch transfers
pin_memory=True + non_blocking=True enables asynchronous CPU-to-GPU transfers, overlapping data loading with computation.
Avoid Unnecessary Gradient Tracking
1# During inference, disable gradient computation
2with torch.no_grad():
3 predictions = model(test_input)
4 # 2x faster, uses less memory — no gradient graph built
5
6# For specific tensors that never need gradients
7embedding = torch.randn(10000, 300, requires_grad=False)
8
9# Detach tensors from computation graph when storing
10cached_features = model.encode(x).detach()
Efficient Matrix Operations
1# Use torch.matmul or @ for matrix multiplication
2C = A @ B # Recommended
3C = torch.matmul(A, B) # Equivalent
4C = torch.mm(A, B) # 2D only, no broadcasting
5
6# Batch matrix multiply
7C = torch.bmm(A, B) # 3D: (batch, m, k) @ (batch, k, n)
8
9# Fused operations when available
10# addmm: C = beta*C + alpha*(A @ B) — single kernel
11C = torch.addmm(bias, A, B)
Memory Management
1# Free GPU memory
2del large_tensor
3torch.cuda.empty_cache()
4
5# Check GPU memory usage
6print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
7print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
8
9# Gradient checkpointing — trade compute for memory
10from torch.utils.checkpoint import checkpoint
11
12class LargeModel(nn.Module):
13 def forward(self, x):
14 # Recompute activations during backward instead of storing
15 x = checkpoint(self.layer1, x)
16 x = checkpoint(self.layer2, x)
17 return self.layer3(x)
torch.compile (PyTorch 2.0+)
1# Compile model for optimized execution
2model = torch.compile(model)
3
4# Specific backends
5model = torch.compile(model, backend="inductor") # Default, good for most cases
6model = torch.compile(model, mode="reduce-overhead") # Minimize overhead
7model = torch.compile(model, mode="max-autotune") # Maximum optimization (slow compile)
torch.compile fuses operations, optimizes memory access patterns, and generates optimized CUDA kernels automatically.
Common Pitfalls
CPU-GPU data transfer in loops: Moving small tensors between CPU and GPU in a loop kills performance. Batch operations on GPU and transfer only results back to CPU.
Using Python lists instead of tensors: [tensor1, tensor2, ...] followed by torch.stack() is slower than pre-allocating a tensor and filling it. Use torch.zeros(n, ...) and index assignment.
.item() in training loops: loss.item() synchronizes CPU and GPU. Call it every N steps for logging, not every step.
Forgetting model.eval() during inference: Without eval(), batch norm and dropout still run in training mode, giving wrong results and wasting computation.
Using torch.tensor() inside a loop: Each call creates a new tensor with gradient tracking overhead. Pre-allocate outside the loop.
Summary
Move tensors to GPU with .to(device) for 10-100x speedup on large operations
Use vectorized operations and broadcasting instead of Python loops
Use torch.einsum for readable, efficient multi-dimensional operations
Enable mixed precision (autocast + GradScaler) for 2x training speed
Set pin_memory=True and num_workers > 0 in DataLoader for faster data loading
Use torch.no_grad() during inference and torch.compile() (PyTorch 2.0+) for automatic optimization