Conv β BatchNorm β ReLU: Why This Order is Standard in CNNs
An explanation of why the Conv β BatchNorm β ReLU ordering is standard in CNNs, grounded in mathematical reasoning and empirical results from modern deep learning architectures.
When designing convolutional neural networks (CNNs), a very common pattern appears repeatedly:
Conv β BatchNorm β ReLU
At first glance this ordering might seem arbitrary, but it is actually grounded in both mathematical reasoning and empirical results from modern deep learning architectures such as ResNet and DenseNet.
This blog explains why this ordering works well and why alternative orderings are usually avoided.
1. The Typical CNN Block
A common CNN building block looks like this:
Input
β
Convolution
β
Batch Normalization
β
ReLU Activation
β
Output
In PyTorch this is usually implemented as:
nn.Sequential(
nn.Conv2d(...),
nn.BatchNorm2d(...),
nn.ReLU(inplace=True)
)
To understand why this order is used, we need to examine what each component does.
2. What the Convolution Layer Produces
A convolution layer performs a linear transformation of the input.
Mathematically:
\[y = Wx + b\]Where:
- \(x\) = input feature map
- \(W\) = convolution kernel
- \(b\) = bias
- \(y\) = linear output
This output contains both positive and negative values.
3. Role of Batch Normalization
Batch Normalization (BN) normalizes the activations using batch statistics:
\[\hat{y} = \frac{y - \mu}{\sigma}\]Where:
- \(\mu\) = batch mean
- \(\sigma\) = batch standard deviation
Then BN applies learnable parameters:
\[z = \gamma \hat{y} + \beta\]This step:
- stabilizes the distribution of activations
- improves gradient flow
- allows larger learning rates
- speeds up convergence
Importantly, BN expects inputs with both positive and negative values, which is exactly what the convolution produces.
4. Role of ReLU
ReLU (Rectified Linear Unit) introduces non-linearity:
\[ReLU(x) = \max(0, x)\]This means:
negative values β 0
positive values β unchanged
ReLU enables neural networks to learn non-linear functions, which is essential for deep learning.
5. Why the Order Conv β BN β ReLU Works Well
5.1 BN Normalizes Linear Activations
Since convolution produces a linear output, BN can properly normalize its distribution.
This keeps activations centered around zero with controlled variance.
5.2 ReLU Receives Well-Conditioned Inputs
If the input to ReLU is zero-centered, roughly half the values will be positive.
This leads to:
- better neuron utilization
- healthier gradient flow
- fewer dead neurons
5.3 Training Stability
This ordering:
Conv β BN β ReLU
helps prevent:
- exploding gradients
- unstable activation distributions
- slow convergence
As a result, most modern architectures use this sequence.
6. What Happens if We Use Conv β ReLU β BN?
Consider the alternative:
Conv β ReLU β BatchNorm
This creates several problems.
Problem 1: ReLU Destroys Information
After ReLU:
all negative values = 0
The distribution becomes heavily skewed.
Problem 2: BN Sees Distorted Statistics
BatchNorm now receives inputs that are:
- mostly positive
- highly sparse
- non-symmetric
This makes normalization less meaningful.
Problem 3: Empirical Performance
In practice, models using:
Conv β ReLU β BN
tend to train less reliably.
7. Special Case: Pre-activation ResNet
A variant of ResNet introduced a different ordering:
BN β ReLU β Conv
Full block:
x β BN β ReLU β Conv β BN β ReLU β Conv β +
This βpre-activationβ design improves:
- gradient flow
- training stability in very deep networks
However, this is a special architecture design, not the general CNN block.
8. Summary
The most widely used CNN ordering is:
Conv β BatchNorm β ReLU
because:
- Convolution produces linear activations.
- BatchNorm normalizes them properly.
- ReLU then applies non-linearity to well-conditioned inputs.
Alternative orderings generally reduce the effectiveness of BatchNorm.
Final Rule of Thumb
For most CNN architectures:
Conv β BatchNorm β ReLU
This ordering provides stable training, good gradient flow, and strong empirical performance.