DEV Community 1h ago

One "+x" That Made 100-Layer Networks Trainable: ResNet Skip Connections

Deep networks have a cruel paradox. In theory, more layers should never hurt - the extra ones could just learn to pass their input through unchanged. In practice, before 2015, stacking more plain layers made networks worse: a 56-layer net had higher training error than a 20-layer one. The gradient vanished on its way back to the early layers, and optimisation couldn't even find that "do nothing" identity mapping. ResNet fixed it with almost absurdly little.

The residual reformulation

Instead of asking a block to learn a full mapping H(x), ask it to learn the residual F(x) = H(x) − x, and add the input back:

def forward(self, x):
    return F.relu(x + self.f(x))  # y = x + F(x)  <- the skip connection

If the ideal mapping is close to identity, F(x) just needs to be near zero - trivial to learn (push the weights toward 0). The block only learns the correction on top of passing the input through.

Why the +1 saves the gradient

Differentiate the block: d(x + F(x))/dx = 1 + F'(x). Backprop multiplies these across blocks. Even when F'(x) is tiny, the factor stays near 1 instead of near 0 - so the product doesn't collapse:

plain: dL/dx1 = product of F'(z) -> 0 (each F' <= ~0.25 for sigmoid)
residual: dL/dx1 = product of (1 + F'(z)) -> ~O(1)

The identity path is a gradient highway straight back to the earliest layers.

Projection shortcuts

When a block changes the feature dimensions (a conv that halves spatial size, doubles channels), x and F(x) no longer match, so you can't add them. Put a 1×1 conv on the skip to project x into the new shape first - the "projection shortcut" from the paper. Most shortcuts are plain identity; only dimension-changing ones need this.

The impact

With residual blocks, the 2015 ResNet paper trained 152-layer networks - an order of magnitude deeper than what worked before - and won ImageNet. Deeper finally meant better again. And skip connections are now everywhere: ResNets, U-Nets, and every Transformer block (x + Sublayer(x)). The same +1 quietly keeps gradients healthy inside modern LLMs.

See a plain net vs a ResNet at the same depth, gradient-by-gradient: https://dev48v.infy.uk/dl/day22-resnet-skip-connections.html

Read on DEV Community ↗ ← Back to News

One "+x" That Made 100-Layer Networks Trainable: ResNet Skip Connections

The residual reformulation

Why the +1 saves the gradient

Projection shortcuts

The impact

Comments