PyTorch
Autograd
Tensor
Gradient
Machine Learning

Autograd.grad for Tensor in pytorch

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

torch.autograd.grad is the right tool when you need explicit gradients of tensors with respect to selected inputs, rather than populating .grad fields via loss.backward(). It is widely used in meta-learning, gradient penalties, physics-informed models, and higher-order optimization. Many errors arise from missing requires_grad, non-scalar outputs without grad_outputs, or forgetting create_graph=True when second derivatives are needed. This article explains when to use autograd.grad, how to structure calls safely, and how to avoid common graph-lifecycle mistakes.

Core Sections

1. Basic usage for scalar outputs

If output is scalar, gradients are straightforward:

python
1import torch
2
3x = torch.tensor([2.0, 3.0], requires_grad=True)
4y = (x ** 2).sum()  # scalar
5
6(gx,) = torch.autograd.grad(y, x)
7print(gx)  # tensor([4., 6.])

Unlike backward, this does not accumulate into x.grad unless you do so manually.

2. Non-scalar outputs require grad_outputs

For vector outputs, provide upstream gradient shape-matched to output.

python
1x = torch.tensor([1.0, 2.0], requires_grad=True)
2out = x ** 3  # vector
3v = torch.tensor([1.0, 1.0])
4
5(gx,) = torch.autograd.grad(out, x, grad_outputs=v)
6print(gx)

Think of this as Jacobian-vector product.

3. Higher-order gradients

Set create_graph=True when gradient itself needs differentiation.

python
1x = torch.tensor(2.0, requires_grad=True)
2y = x ** 4
3
4(g1,) = torch.autograd.grad(y, x, create_graph=True)   # 4*x^3
5(g2,) = torch.autograd.grad(g1, x)                      # second derivative
6print(g1.item(), g2.item())

Without create_graph=True, second derivative calls fail because graph history is discarded.

4. Multiple inputs and optional gradients

You can differentiate with respect to several tensors at once:

python
1a = torch.tensor(2.0, requires_grad=True)
2b = torch.tensor(5.0, requires_grad=True)
3z = a * b + b ** 2
4
5ga, gb = torch.autograd.grad(z, (a, b))
6print(ga, gb)

If some inputs may be unused, set allow_unused=True and handle None gradients carefully.

5. autograd.grad versus backward

Use backward for standard training loops where parameter grads should accumulate in .grad. Use autograd.grad for precise control and intermediate gradient computations.

For example, gradient penalty in WGAN-GP often uses autograd.grad to compute norm of gradients with respect to interpolated inputs.

6. Memory and graph lifecycle management

Gradient graph retention can increase memory quickly. Avoid unnecessary retain_graph=True; use it only if the same graph is reused for multiple backward passes. Clear references to large tensors promptly, and profile memory when enabling higher-order derivatives.

Validation and production readiness

A reliable solution should include explicit validation and observability, not just a working snippet. Add representative test inputs for normal flow, malformed input, and boundary values so behavior is stable under change. Where timing or throughput matters, keep a small benchmark scenario and run it after refactors to catch accidental slowdowns early. If external systems are involved, include retry, timeout, and failure-path tests to verify the system degrades gracefully rather than hanging or failing silently.

Operationally, document assumptions close to the implementation: dependency versions, environment requirements, timezone or locale expectations, and any platform-specific behavior. Add structured logs for key decision points and failures so production incidents are diagnosable without reproducing every condition locally. For teams, define a minimal rollout checklist that covers backward compatibility, monitoring alerts, and rollback steps. These checks reduce incidents caused by integration gaps, which are more common than syntax errors in real deployments.

Common Pitfalls

  • Forgetting requires_grad=True on tensors you want gradients for.
  • Calling autograd.grad on vector outputs without grad_outputs.
  • Omitting create_graph=True when computing higher-order derivatives.
  • Misusing retain_graph=True, causing memory growth.
  • Expecting .grad fields to be populated automatically like backward.

Summary

torch.autograd.grad provides fine-grained gradient control for advanced workflows. Use it for explicit derivative queries, Jacobian-vector products, and higher-order terms, while reserving backward for routine parameter updates. Correct handling of grad_outputs, create_graph, and graph retention is essential for both correctness and memory stability. With these patterns, autograd becomes a reliable tool beyond standard training loops.


Course illustration
Course illustration

All Rights Reserved.