Loss Scaling Free - Updated

For years, the solution to this instability was . If you have ever trained a model in FP16, you’ve likely tweaked a "loss scaling factor," agonizing over whether to set it to static values or let the optimizer dynamically adjust it.

To understand why we needed loss scaling in the first place, we have to look at the IEEE 754 standard for floating-point numbers.

To combat underflow, engineers introduced . The concept is simple: loss scaling free

| Format | Exponent Bits | Mantissa Bits | Dynamic Range (approx) | |--------|---------------|---------------|------------------------| | FP16 | 5 | 10 | 5.96e-8 to 65504 | | BF16 | 8 | 7 | 1.18e-38 to 3.4e38 |

# Define the model model = nn.Sequential([...]) For years, the solution to this instability was

is not magic — it’s BF16 replacing FP16. If your hardware supports BF16, you should default to loss scaling free. If stuck with FP16, keep the scaler. The future is BF16/FP8 with per-block scaling, making manual loss scaling a relic for most users.

✅ :

During training, the loss value of a neural network can vary greatly, especially when using large batch sizes or complex models. This can cause issues with the gradients, leading to: