This talk by Andrew Gordon Wilson (2019). Other talk by Andrew Gordon Wilson (2022).

The following notes are taken from his slides:

Why Bayesian Deep learning?

  • Model construction and understand generalization

  • Decision making

  • Better point estimates. Marginalization integrates away the posterior vs optimization.

    \begin{align*}
     P(y|x_*, y, X) &= \int P(y|x_*, w)p(w|y, X)\mathrm{d}w\\
                    &\approx \frac{1}{N_{samp}} \sum_i P(y|x_*, w_i)\\
     \end{align*}
    
  • Intepretability, incorporate expert knowledge

  • Successful in the second wave of DL

  • NN are less mysterious under the lens of probability theory

  • Bayesian neural network, by averaging over the posterior, take into account the fact that wide basin of attractions generalize better.

Why not?

  • Computationally intractable. BUT all we care about is averages over the posteriors; we don’t need to keep all the samples to do this.
  • Involves a lot of moving parts

Some practical stuff

More principled Fast Geometric Ensembling

Stochastic weight averaging allows to compute an approximation in weight space from which we can sample;

Random low-dimensional subspace

  • Run SGD with high LR
  • Collect snapshots
  • Use SWA solution as weights
  • Find the first PCA components of

Are these approaches practical?

HMC works better than everything else out of the box.