DEV Community

Activation Functions: Why Non-Linearity Is Everything

The Linearity Collapse

There's a proof worth knowing: if you stack linear transformations without any non-linearity between them, the entire network is equivalent to a single linear transformation. Ten layers, a hundred layers, a thousand - they all collapse to one matrix multiply. Activation functions are what prevent this collapse.

The linearity collapse, demonstrated:

import numpy as np
W1 = np.random.randn(4, 4)
W2 = np.random.randn(4, 4)
W3 = np.random.randn(4, 4)
W_collapsed = W3 @ W2 @ W1
x = np.random.randn(4)
out_deep = W3 @ W2 @ W1 @ x
out_shallow = W_collapsed @ x
print(np.allclose(out_deep, out_shallow))  # True - three layers = one layer

The three layers have zero additional expressive power over one. Adding a non-linear function between each layer breaks this.

Sigmoid: the original, and its problems

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1.0 - s)

for x_val in [-5, -2, 0, 2, 5]:
    g = sigmoid_grad(x_val)
    print(f"x={x_val:3d} gradient={g:.6f}")
x= -5 gradient=0.006648
x= -2 gradient=0.104994
x=  0 gradient=0.250000
x=  2 gradient=0.104994
x=  5 gradient=0.006648

At x=±5, the gradient is 26× smaller than at x=0. In a 10-layer network, the compound effect kills gradients entirely - the vanishing gradient problem.

ReLU: the surprisingly effective fix

def relu(x):
    return np.maximum(0, x)

def relu_grad(x):
    return (x > 0).astype(float)

The gradient for positive inputs is exactly 1. Gradients don't shrink as they pass through ReLU on the positive side. Deep networks could finally be trained.

The cost: neurons whose inputs are consistently negative receive zero gradient - the "dying ReLU" problem. In practice this matters less than you'd think.

# Leaky ReLU: small gradient for negatives
def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

GELU: what GPT uses

GELU is a smooth approximation of ReLU:

GELU(x) ≈ 0.5 × x × (1 + tanh(√(2/π) × (x + 0.044715 × x³)))

for x_val in [-0.5, -0.2, 0.0, 0.2, 0.5]:
    r = max(0, x_val)
    g = gelu(np.array([x_val]))[0]
    print(f"x={x_val:4.1f} ReLU={r:.4f} GELU={g:.4f}")
x=-0.5 ReLU=0.0000 GELU=-0.1543
x=-0.2 ReLU=0.0000 GELU=-0.0563
x= 0.0 ReLU=0.0000 GELU=0.0000
x= 0.2 ReLU=0.2000 GELU=0.1155
x= 0.5 ReLU=0.5000 GELU=0.3457

The smoothness makes optimization slightly easier. GPT-2 and BERT both use GELU.

SwiGLU: what modern models use

SwiGLU is the activation used in LLaMA, Mistral, and most current large models:

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.W = nn.Linear(d_model, d_ff, bias=False)
        self.V = nn.Linear(d_model, d_ff, bias=False)
        self.out = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x):
        gate = F.silu(self.W(x))  # SiLU = x * sigmoid(x)
        content = self.V(x)
        return self.out(gate * content)

One linear projection gates whether the other passes through - more expressive than a simple element-wise non-linearity.

Gradient flow comparison

def test_gradient_flow(activation_fn, depth=20, seed=0):
    torch.manual_seed(seed)
    layers = []
    for _ in range(depth):
        layers.extend([nn.Linear(64, 64), activation_fn()])
    model = nn.Sequential(*layers)
    x = torch.randn(16, 64, requires_grad=True)
    out = model(x).sum()
    out.backward()
    return x.grad.abs().mean().item()

activations = {"ReLU": nn.ReLU, "Sigmoid": nn.Sigmoid, "GELU": nn.GELU, "SiLU": nn.SiLU}
for name, act in activations.items():
    grad = test_gradient_flow(act, depth=20)
    print(f"{name:<10}: input gradient magnitude = {grad:.6f}")
ReLU      : input gradient magnitude = 0.003241
Sigmoid   : input gradient magnitude = 0.000001
GELU      : input gradient magnitude = 0.004817
SiLU      : input gradient magnitude = 0.004923

Sigmoid is thousands of times worse. ReLU, GELU, and SiLU are all in the same ballpark - the gap between them matters far less than the gap from sigmoid.

Summary

Function Where used Key property
Sigmoid Old networks Saturates; vanishing gradients
ReLU CNNs, MLPs Simple; gradient=1 for positives
GELU GPT-2, BERT Smooth; slight negative outputs
SiLU/Swish Modern models Smooth; slightly better performance
SwiGLU LLaMA, Mistral Expressive gating mechanism

The progression follows one thread: keep gradients alive through many layers, give the network enough expressive power, don't overcomplicate what works.

This is part of an ongoing series on AI internals. Full article with more context at machina.chat/blog.

Comments

No comments yet. Start the discussion.