Monish Keswani | concepts

08 Mar 2026

evergreen paper

deep-learning regularization neural-networks

Paper: Srivastava, Hinton, et al. (2014)

Motivation: Prevents “co-adaptation” (feature dependency); approximates an ensemble of $2^n$ thinned networks.
Equation: $r^{(l)} \sim \text{Bernoulli}(p)$; $\tilde{y}^{(l)} = r^{(l)} * y^{(l)}$.
Training (Inverted): $a_{train} = \frac{a \cdot mask}{1-p}$ (keeps expected sum consistent).
Inference: $a_{test} = a$ (all neurons active; no scaling needed).
Pros: Robust generalization; $O(n)$ overhead; prevents overfitting in deep/wide nets.
Cons: Increases training time (approx. 2x); requires tuning $p$ (dropout rate).
Impl: nn.Dropout(p) (PyTorch) or layers.Dropout(p) (Keras).

08 Mar 2026

evergreen paper

deep-learning regularization neural-networks

Reviewed: 08 Mar 2026

Paper: Srivastava, Hinton, et al. (2014)

Motivation: Prevents “co-adaptation” (feature dependency); approximates an ensemble of $2^n$ thinned networks.
Equation: $r^{(l)} \sim \text{Bernoulli}(p)$; $\tilde{y}^{(l)} = r^{(l)} * y^{(l)}$.
Training (Inverted): $a_{train} = \frac{a \cdot mask}{1-p}$ (keeps expected sum consistent).
Inference: $a_{test} = a$ (all neurons active; no scaling needed).
Pros: Robust generalization; $O(n)$ overhead; prevents overfitting in deep/wide nets.
Cons: Increases training time (approx. 2x); requires tuning $p$ (dropout rate).
Impl: nn.Dropout(p) (PyTorch) or layers.Dropout(p) (Keras).

Date	Concept	Status	Source	Tags
08-03-2026	Dropout	evergreen	paper	deep-learning regularization neural-networks
Paper: Srivastava, Hinton, et al. (2014) Motivation: Prevents “co-adaptation” (feature dependency); approximates an ensemble of $2^n$ thinned networks. Equation: $r^{(l)} \sim \text{Bernoulli}(p)$; $\tilde{y}^{(l)} = r^{(l)} * y^{(l)}$. Training (Inverted): $a_{train} = \frac{a \cdot mask}{1-p}$ (keeps expected sum consistent). Inference: $a_{test} = a$ (all neurons active; no scaling needed). Pros: Robust generalization; $O(n)$ overhead; prevents overfitting in deep/wide nets. Cons: Increases training time (approx. 2x); requires tuning $p$ (dropout rate). Impl: `nn.Dropout(p)` (PyTorch) or `layers.Dropout(p)` (Keras).