PyTorch
gradient
floor method
autograd
deep learning

What is the gradient of pytorch floor gradient method?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

The torch.floor() function rounds each element of a tensor down to the nearest integer. Its gradient is zero almost everywhere because the floor function is a step function — flat between integers with discontinuous jumps at integer values. PyTorch's autograd returns a gradient of zero for torch.floor(), which means gradients do not flow through it during backpropagation.

The Floor Function and Its Derivative

Mathematically, floor(x) is a piecewise constant function. Its derivative is:

  • 0 for all non-integer values of x (the function is flat)
  • Undefined at integer values (the function has jump discontinuities)

PyTorch defines the gradient as 0 everywhere, including at integer points:

python
1import torch
2
3x = torch.tensor([1.7, 2.0, 3.5, -0.3], requires_grad=True)
4y = torch.floor(x)
5y.sum().backward()
6
7print(x.grad)  # tensor([0., 0., 0., 0.])

Why Zero Gradients Are a Problem

When torch.floor() appears in a computation graph, it blocks gradient flow entirely. Any parameters upstream of a floor operation receive zero gradients and cannot be updated by gradient descent:

python
1x = torch.tensor([2.7], requires_grad=True)
2
3# Gradient flows normally without floor
4y1 = x * 3
5y1.backward()
6print(x.grad)  # tensor([3.])
7
8x.grad.zero_()
9
10# Floor kills the gradient
11y2 = torch.floor(x) * 3
12y2.backward()
13print(x.grad)  # tensor([0.]) — gradient is blocked

This is the same problem with all rounding operations (floor, ceil, round, trunc) and discrete operations like argmax.

Workaround 1: Straight-Through Estimator (STE)

The most common workaround is the Straight-Through Estimator, which uses the floor function in the forward pass but passes gradients through as if floor were the identity function:

python
1class STEFloor(torch.autograd.Function):
2    @staticmethod
3    def forward(ctx, x):
4        return torch.floor(x)
5
6    @staticmethod
7    def backward(ctx, grad_output):
8        # Pass gradient straight through (as if floor = identity)
9        return grad_output
10
11ste_floor = STEFloor.apply
12
13x = torch.tensor([2.7], requires_grad=True)
14y = ste_floor(x) * 3
15y.backward()
16print(x.grad)  # tensor([3.]) — gradient flows through

The STE is widely used in quantization-aware training, binary neural networks, and discrete optimization.

Workaround 2: Soft Floor Approximation

Replace the hard floor with a differentiable approximation:

python
1def soft_floor(x, temperature=10.0):
2    """Differentiable approximation of floor using sigmoid."""
3    # floor(x) ≈ x - 0.5 + 0.5 * tanh(temperature * (x - round(x)))
4    frac = x - torch.round(x)
5    return x - frac + frac.detach() - frac.detach()
6
7# Simpler: subtract the fractional part using STE
8def ste_floor_simple(x):
9    """Floor with straight-through gradient."""
10    return x - (x - torch.floor(x)).detach() + (x - torch.floor(x)) - (x - torch.floor(x))

A cleaner approach using detach():

python
1def floor_ste(x):
2    """Floor in forward, identity in backward."""
3    return x + (torch.floor(x) - x).detach()
4
5x = torch.tensor([2.7], requires_grad=True)
6y = floor_ste(x) * 3
7y.backward()
8print(x.grad)  # tensor([3.])

Workaround 3: Gumbel-Softmax for Categorical

If you use floor to create discrete categories, consider Gumbel-Softmax instead:

python
1import torch.nn.functional as F
2
3logits = torch.tensor([1.0, 2.0, 0.5], requires_grad=True)
4
5# Differentiable approximation of categorical sampling
6soft_sample = F.gumbel_softmax(logits, tau=0.5, hard=False)
7# hard=True gives one-hot in forward, soft gradients in backward
8hard_sample = F.gumbel_softmax(logits, tau=0.5, hard=True)

Real-World Use Cases

Use CaseWhy Floor Is NeededWorkaround
Quantization-aware trainingDiscretize weights to int8/int4STE
Pixel coordinate mappingMap continuous coords to pixel gridSTE or bilinear interpolation
Binning/histogramsAssign values to binsSoft binning with sigmoid
Integer arithmetic in networksEnforce integer constraintsSTE + clamp

Common Pitfalls

  • Discontinuity: When using operations like torch.floor() within neural networks, be mindful of how discontinuities and zero gradients may affect the learning process. Training may stall completely if floor is on the critical path.
  • Gradient Flow: Always consider how these operations interact with gradient flow, particularly within complex models that require rich gradients for effective training.
  • STE bias: The straight-through estimator introduces bias because the forward and backward passes use different functions. This can cause training instability with large learning rates.
  • Numerical edge cases: Values very close to integers (e.g., 2.9999999) may floor differently than expected due to floating-point representation. Add small epsilon offsets if needed.
  • Double backward: Custom autograd functions for STE may not support higher-order gradients by default. Implement backward carefully if you need second derivatives.

Summary

  • torch.floor() has a gradient of zero everywhere — it blocks backpropagation
  • Use the Straight-Through Estimator (STE) to pass gradients through floor in the backward pass
  • The x + (torch.floor(x) - x).detach() pattern is the simplest STE implementation
  • For categorical/discrete outputs, consider Gumbel-Softmax instead of floor
  • All rounding operations (ceil, round, trunc) have the same zero-gradient issue

Course illustration
Course illustration

All Rights Reserved.