Adam: The Optimization Algorithm That Made LLMs Practical
Before Adam: Why Training Neural Networks Was So Difficult
Imagine you're hiking down a mountain in dense fog. You can only see the slope directly beneath your feet. The obvious strategy is simple: take one step downhill. This is essentially what gradient descent does.
For a neural network, the gradient tells us which direction reduces the loss. The update rule is simply:
ฮธ = ฮธ - ฮท โL
Where:
- ฮธ = model parameters
- ฮท = learning rate
- โL = gradient
Simple. Unfortunately, real neural networks don't resemble smooth mountains. Instead, they're more like:
- Steep cliffs
- Long narrow valleys
- Flat plateaus
- Noisy terrain
- Millions-or billions-of dimensions
The gradient also changes every mini-batch because we only compute it on a small sample of data. This leads to stochastic gradient descent (SGD). Instead of walking smoothly downhill, it's like trying to descend while someone randomly shakes the mountain beneath you.
Momentum: Giving Optimization Some Inertia
Researchers realized that humans don't instantly stop and change direction every step. Instead, we build momentum. Optimization borrowed the same idea. Instead of only following today's gradient, momentum also remembers previous gradients.
Imagine pushing a heavy shopping cart. If you push in roughly the same direction repeatedly, it builds speed. Small bumps don't immediately change its path.
Mathematically:
v_t = ฮฒ v_{t-1} + (1 - ฮฒ) g_t
Where:
g_tis today's gradientv_tis the accumulated velocity
Now updates become:
ฮธ = ฮธ - ฮท v_t
This smooths noisy updates and helps escape shallow local irregularities. Momentum was already a huge improvement. But another problem remained.
Different Parameters Learn at Different Speeds
Suppose you're training a neural network with 500 million parameters. Some parameters receive gradients like 0.00002. Others receive gradients like 45. Using one global learning rate becomes problematic.
If the learning rate is large enough for tiny gradients, the large gradients explode. If it's safe for the large gradients, the tiny gradients barely move. It's like paying every employee in a company exactly the same bonus regardless of performance, seniority, or role. Some people are overpaid. Others barely notice the reward.
Optimization needs to adapt individually. This insight led to algorithms like AdaGrad and RMSProp. Adam combined the best ideas from both.
Adam: Combining Momentum and Adaptive Learning Rates
In 2014, Diederik P. Kingma and Jimmy Ba introduced Adam in the paper: Adam: A Method for Stochastic Optimization. The idea is beautifully elegant.
Adam maintains two running statistics for every parameter:
- First moment - The average gradient. Think of this as momentum.
m_t - Second moment - The average squared gradient. Think of this as measuring how "volatile" or "uncertain" this parameter's updates have been.
v_t
The update becomes approximately:
ฮธ = ฮธ - ฮท * m_t / (โv_t + ฮต)
This produces a fascinating behavior. If a parameter consistently receives huge gradients, its denominator becomes larger, making future updates smaller. If another parameter rarely changes, its denominator stays small, allowing relatively larger updates. Every parameter effectively receives its own personalized learning rate.
A Back-of-the-Envelope Example
Suppose two parameters have identical momentum:
- Parameter A: Average gradient = 2, Average squared gradient = 100
- Parameter B: Average gradient = 2, Average squared gradient = 4
Ignoring ฮต, Parameter A updates by 2 / โ100 = 0.2. Parameter B updates by 2 / โ4 = 1. Even though both gradients are identical today, Adam trusts Parameter B much more because its historical variance is lower. This automatic scaling is one reason Adam trains deep networks so effectively.
Why Bias Correction Exists
There's one subtle issue. At the beginning of training, both moving averages start at zero. That means early estimates are biased toward zero. Kingma and Ba introduced bias correction:
mฬ_t = m_t / (1 - ฮฒโแต) and vฬ_t = v_t / (1 - ฮฒโแต)
These corrections rapidly remove the startup bias. It's a small mathematical trick that has a surprisingly large practical impact during the first optimization steps.
Why Adam Became So Important for Deep Learning
Consider training GPT-style models. Modern LLMs easily contain 7 billion parameters, 70 billion parameters, or over a trillion parameters in some research systems. Every optimization step updates every trainable parameter. Even a modest training run might execute hundreds of thousands of optimization steps.
That means Adam performs on the order of billions of parameters ร hundreds of thousands of updates, resulting in quadrillions of parameter update decisions over the course of training. Without stable optimization, training would often diverge. Learning would become painfully slow. GPU time-costing thousands or even millions of dollars-would be wasted.
Optimization isn't merely a mathematical curiosity. It's an operations and economics problem. A 10% improvement in convergence speed on a multi-million-dollar training run can translate into hundreds of thousands of dollars in savings, shorter experimentation cycles, and faster scientific progress. Faster convergence also means researchers can iterate on model architectures more quickly, reducing the opportunity cost of long training jobs.
Adam became popular because it usually works well with relatively little hyperparameter tuning. Researchers could spend less time adjusting learning rates and more time exploring new model architectures. That practicality accelerated progress across computer vision, speech recognition, recommendation systems, and eventually large language models.
Adam Isn't Perfect
As influential as Adam has been, researchers have also identified limitations. Some studies found that vanilla stochastic gradient descent can produce models that generalize better on certain vision tasks. Others observed convergence issues under specific theoretical settings.
As LLMs grew larger, practitioners developed variants such as:
- AdamW (decoupled weight decay)
- AdaFactor (reduced memory footprint)
- Lion (sign-based optimization)
In fact, many modern Transformer implementations train with AdamW, which separates weight decay from Adam's adaptive updates and often improves regularization. Engineering rarely ends with one perfect algorithm. Instead, progress comes through continual refinement.
The Bigger Lesson
When the Transformer paper appeared in 2017, attention deservedly captured the headlines. But Transformers alone weren't enough. Modern deep learning stands on layers of innovations:
- Better hardware
- Larger datasets
- Improved initialization
- Normalization methods
- Residual connections
- Efficient optimizers
Adam is one of those foundational technologies. It's rarely discussed outside machine learning circles, yet it quietly powers the optimization of billions of parameters every day. Sometimes the biggest breakthroughs aren't new model architectures. Sometimes they're simply better ways of taking the next step downhill.
Did Adam fundamentally change deep learning, or was it simply the optimizer that happened to arrive at the right time? I'd love to hear your thoughts-and if you've trained neural networks yourself, have you ever switched away from Adam and seen better results?
Comments
No comments yet. Start the discussion.