Monish Keswani | Conv → BatchNorm → ReLU: Why This Order is Standard in CNNs

When designing convolutional neural networks (CNNs), a very common pattern appears repeatedly:

Conv → BatchNorm → ReLU

At first glance this ordering might seem arbitrary, but it is actually grounded in both mathematical reasoning and empirical results from modern deep learning architectures such as ResNet and DenseNet.

This blog explains why this ordering works well and why alternative orderings are usually avoided.

1. The Typical CNN Block

A common CNN building block looks like this:

Input
  ↓
Convolution
  ↓
Batch Normalization
  ↓
ReLU Activation
  ↓
Output

In PyTorch this is usually implemented as:

nn.Sequential(
    nn.Conv2d(...),
    nn.BatchNorm2d(...),
    nn.ReLU(inplace=True)
)

To understand why this order is used, we need to examine what each component does.

2. What the Convolution Layer Produces

A convolution layer performs a linear transformation of the input.

Mathematically:

\[y = Wx + b\]

Where:

\(x\) = input feature map
\(W\) = convolution kernel
\(b\) = bias
\(y\) = linear output

This output contains both positive and negative values.

3. Role of Batch Normalization

Batch Normalization (BN) normalizes the activations using batch statistics:

\[\hat{y} = \frac{y - \mu}{\sigma}\]

Where:

\(\mu\) = batch mean
\(\sigma\) = batch standard deviation

Then BN applies learnable parameters:

\[z = \gamma \hat{y} + \beta\]

This step:

stabilizes the distribution of activations
improves gradient flow
allows larger learning rates
speeds up convergence

Importantly, BN expects inputs with both positive and negative values, which is exactly what the convolution produces.

4. Role of ReLU

ReLU (Rectified Linear Unit) introduces non-linearity:

\[ReLU(x) = \max(0, x)\]

This means:

negative values → 0
positive values → unchanged

ReLU enables neural networks to learn non-linear functions, which is essential for deep learning.

5. Why the Order Conv → BN → ReLU Works Well

5.1 BN Normalizes Linear Activations

Since convolution produces a linear output, BN can properly normalize its distribution.

This keeps activations centered around zero with controlled variance.

5.2 ReLU Receives Well-Conditioned Inputs

If the input to ReLU is zero-centered, roughly half the values will be positive.

This leads to:

better neuron utilization
healthier gradient flow
fewer dead neurons

5.3 Training Stability

This ordering:

Conv → BN → ReLU

helps prevent:

exploding gradients
unstable activation distributions
slow convergence

As a result, most modern architectures use this sequence.

6. What Happens if We Use Conv → ReLU → BN?

Consider the alternative:

Conv → ReLU → BatchNorm

This creates several problems.

Problem 1: ReLU Destroys Information

After ReLU:

all negative values = 0

The distribution becomes heavily skewed.

Problem 2: BN Sees Distorted Statistics

BatchNorm now receives inputs that are:

mostly positive
highly sparse
non-symmetric

This makes normalization less meaningful.

Problem 3: Empirical Performance

In practice, models using:

Conv → ReLU → BN

tend to train less reliably.

7. Special Case: Pre-activation ResNet

A variant of ResNet introduced a different ordering:

BN → ReLU → Conv

Full block:

x → BN → ReLU → Conv → BN → ReLU → Conv → +

This “pre-activation” design improves:

gradient flow
training stability in very deep networks

However, this is a special architecture design, not the general CNN block.

8. Summary

The most widely used CNN ordering is:

Conv → BatchNorm → ReLU

because:

Convolution produces linear activations.
BatchNorm normalizes them properly.
ReLU then applies non-linearity to well-conditioned inputs.

Alternative orderings generally reduce the effectiveness of BatchNorm.

Final Rule of Thumb

For most CNN architectures:

Conv → BatchNorm → ReLU

This ordering provides stable training, good gradient flow, and strong empirical performance.