Adam: A Method for Stochastic Optimization
October 27, 2025 | 330 words | 2min read
Paper Title: Adam: A Method for Stochastic Optimization
Link to Paper: https://arxiv.org/abs/1412.6980
Date: 22. Dec. 2014
Paper Type: Optimizer, Learning Techniques, Deep Learning
Short Abstract: In this paper, the famous Adam optimizer is introduced: a first-order, gradient-based optimization method that uses adaptive estimates of momentum. It generalizes well across different architectures and tasks and outperforms many optimizers that came before it.
1. Introduction
Many problems in the fields of science and engineering can be formulated as the optimization of some scalar objective function requiring maximization or minimization.
If this objective function is differentiable, then gradient descent can be used, which is relatively efficient since it only requires first-order derivatives.
Many of these functions can also be stochastic — for example, they might be defined on subsets (mini-batches) of data. In such cases, stochastic gradient descent (SGD) can be used for improved efficiency. The objective function can also be noisy, such as when using techniques like dropout.
This paper focuses on high-dimensional, stochastic, differentiable objective functions, where higher-order optimization methods perform poorly due to the curse of dimensionality.
In light of this, the paper proposes Adam, a first-order optimization method that requires little memory. This method computes individual learning rates for every parameter, based on estimates of the first and second moments of the gradient. As such, it combines the advantages of three methods:
- RMSprop: learning individual learning rates for each parameter
- Adagrad: using past gradients to adapt the learning rate
- Momentum: accelerating convergence by smoothing gradients
2. Method
The algorithm updates exponential moving averages ( m_t ) of the gradient and the squared gradient ( v_t ), where hyperparameters control the exponential decay rates of these moving averages.

3. Experiments
To empirically evaluate their optimizer, the authors tested it on different architectures — e.g., logistic regression and neural networks — and on different datasets such as MNIST and IMDB movie classification.


4. Conclusion
The introduced Adam optimizer is efficient and performs well, not only proven theoretically but also demonstrated empirically.