concepts

A running log of concepts I learn, organized by date.

Tags: All deep-learning neural-networks regularization
Status: All evergreen
08 Mar 2026

Dropout

evergreen paper

Paper: Srivastava, Hinton, et al. (2014)

  • Motivation: Prevents “co-adaptation” (feature dependency); approximates an ensemble of $2^n$ thinned networks.
  • Equation: $r^{(l)} \sim \text{Bernoulli}(p)$; $\tilde{y}^{(l)} = r^{(l)} * y^{(l)}$.
  • Training (Inverted): $a_{train} = \frac{a \cdot mask}{1-p}$ (keeps expected sum consistent).
  • Inference: $a_{test} = a$ (all neurons active; no scaling needed).
  • Pros: Robust generalization; $O(n)$ overhead; prevents overfitting in deep/wide nets.
  • Cons: Increases training time (approx. 2x); requires tuning $p$ (dropout rate).
  • Impl: nn.Dropout(p) (PyTorch) or layers.Dropout(p) (Keras).
08 Mar 2026
evergreen paper

Dropout

Reviewed: 08 Mar 2026

Paper: Srivastava, Hinton, et al. (2014)

  • Motivation: Prevents “co-adaptation” (feature dependency); approximates an ensemble of $2^n$ thinned networks.
  • Equation: $r^{(l)} \sim \text{Bernoulli}(p)$; $\tilde{y}^{(l)} = r^{(l)} * y^{(l)}$.
  • Training (Inverted): $a_{train} = \frac{a \cdot mask}{1-p}$ (keeps expected sum consistent).
  • Inference: $a_{test} = a$ (all neurons active; no scaling needed).
  • Pros: Robust generalization; $O(n)$ overhead; prevents overfitting in deep/wide nets.
  • Cons: Increases training time (approx. 2x); requires tuning $p$ (dropout rate).
  • Impl: nn.Dropout(p) (PyTorch) or layers.Dropout(p) (Keras).
Date Concept Status Source Tags
08-03-2026 Dropout evergreen paper deep-learning regularization neural-networks