Notes from the Wired

AdamW: Decoupled Weight Decay Regularization

October 27, 2025 | 436 words | 3min read

Paper Title: AdamW: Decoupled Weight Decay Regularization

Link to Paper: https://arxiv.org/abs/1711.05101

Date: 14. Nov. 2017

Paper Type: Optimizer, Learning Techniques, Deep Learning

Short Abstract: This paper introduces the AdamW optimizer, an improvement on the Adam optimizer that additionally incorporates weight decay.

1. Introduction

Adaptive, gradient-based optimizers such as AdaGrad, RMSprop, and Adam have become the default choice for training feed-forward neural networks. Still, state-of-the-art performance on many image datasets, such as CIFAR-10 and CIFAR-100, is often achieved using SGD.

Why is that?

The authors of this paper suggest that one reason lies in the way L₂ regularization and weight decay are used in conjunction with optimizers. They show that a major reason SGD outperforms Adam is that L₂ regularization, which is commonly applied to both methods, works much better with SGD. Instead, they argue, Adam should use weight decay.

They arrive at the following observations:

2. Weight Decay and Decoupled Weight Decay

Weight decay works as follows:

$$ \theta_{t+1} = (1 - \lambda)\theta_t + \alpha \Delta f_t(\theta_t) $$

where \( \lambda \) defines the rate of weight decay per step.

For SGD, this is equivalent to L₂ regularization, but only if the regularization term is set to \( \lambda / \alpha \).

The authors propose using decoupled weight decay, where weights are updated separately from the gradient update. This decouples \( \lambda \) (weight decay) and \( \alpha \) (learning rate).

What is the intuitive difference between weight decay and L₂ regularization?

3. Experiments

4. Conclusion

AdamW was shown to perform better, both theoretically and empirically, than Adam, demonstrating improved generalization and optimization efficiency.

Email Icon reply via email