This talk by Andrew Gordon Wilson (2019). Other talk by Andrew Gordon Wilson (2022).
The following notes are taken from his slides:
Why Bayesian Deep learning?
-
Model construction and understand generalization
-
Decision making
-
Better point estimates. Marginalization integrates away the posterior vs optimization.
\begin{align*} P(y|x_*, y, X) &= \int P(y|x_*, w)p(w|y, X)\mathrm{d}w\\ &\approx \frac{1}{N_{samp}} \sum_i P(y|x_*, w_i)\\ \end{align*} -
Intepretability, incorporate expert knowledge
-
Successful in the second wave of DL
-
NN are less mysterious under the lens of probability theory
-
Bayesian neural network, by averaging over the posterior, take into account the fact that wide basin of attractions generalize better.
Why not?
- Computationally intractable. BUT all we care about is averages over the posteriors; we don’t need to keep all the samples to do this.
- Involves a lot of moving parts
Some practical stuff
More principled Fast Geometric Ensembling
Stochastic weight averaging allows to compute an approximation in weight space from which we can sample;
Random low-dimensional subspace
- Run SGD with high LR
- Collect snapshots
- Use SWA solution as weights
- Find the first PCA components of
Are these approaches practical?
HMC works better than everything else out of the box.