Principle and motivation
We use the same solver for the overdamped Langevin dynamics as for the Metropolis-Adjusted Langevin Algorithm, but we work with an estimator for the gradient of the log-probability density function instead of the gradient itself. There are two reasons why one would want to do that:
- When the amount of data is huge the full gradient can be very expensive to compute;
- When there is a risk to be stuck in a region of the parameter space that is not representative of the typical set; the algorithm has annealing properties.
The exposition follows [Welling & Teh]. At time a subset of data items is sampled, a step size is chosen. The gradient is approximated as:
\begin{equation*}
\hat{\nabla} \log \pi(\theta_{t}) = \nabla \log p(\theta_{t}) + \frac{N}{n} \sum_{i} \nabla \log p(x_{it}|\theta_{t})
\end{equation*}So the proposed update using the Langevin dynamics is:
\begin{align*}
\theta_{t+1} &= \theta_{t} + \frac{\epsilon_{t}}{2}\;\left(\nabla \log p(\theta_{t}) + \frac{N}{n} \sum_{i} \nabla \log p(x_{it}|\theta_{t})\right) + \xi_{t}\\
\xi_t &\sim \operatorname{Normal}\left(0, \epsilon_t\right)
\end{align*}We do not use RMH acceptance/rejection step (which would require evaluation over the whole dataset).
Convergence
References
- Welling & Teh (2011) Bayesian Learning via Stochastic Gradient Langevin Dynamics