Generative image models based on ordinary differential equations can be seen as forms of variational auto-encoders with a partially deterministic inference network. $\newcommand{\coloneqq}{\mathrel{\vcenter{:}}=}$

Variational Auto-encoders

Given a dataset of samples $y$ from an unknown distribution $\pi_1$, generative models attempt to find parameters $\theta$ for a fixed family that maximizes the likelihood $p_\theta(y)$. When the family implicitly marginalizes over a latent variable $z$ so that $\log p_\theta(y) = \log E_{z \sim p} p_\theta(y \vert z)$, maximum likelihood estimation becomes intractable in general. Variational auto-encoders address this problem by maximizing a lower bound on the likelihood: \(\begin{align*} \log p_\theta(y) &= \log E_{z \sim p} p_\theta(y \vert z) \\ &=\log E_{q(z \vert y)} \frac{p(y\vert z) p(z)}{q(z\vert y)} &\text{ (importance sampling)}\\ & \geq E_{q(z \vert y)} \log \frac{p(y\vert z) p(z)}{q(z)} &\text{ (Jensen's inequality)} \end{align*}\)

This quantity $\mathcal{L} \coloneqq E_{q(z \vert y)} \log p(y,z) - \log q(z)$ is known as the evidence lower bound or ELBO. The distribution $q(z\vert y)$ used for importance sampling is usually parameterized by a neural net and is known as the inference network. In what follows, we will implicitly assume that $q$ is a function of neural net parameters $\phi$.

From Variational Auto-encoders to Rectified Flows

Assume that the model’s prior over $z$ factors as $p(z) = p(z_0)\prod_t p(z_{t+\Delta} \vert z_t)$, where $t$ ranges from $0$ to $1$ in increments of $\Delta$ and $z_1$ is observed (taking the place of $y$ in the discussion so far). Instead of fixing the model $p$ and learning parameters for a distribution $q$ to approximate its posterior, with diffusion-based approaches, we do the opposite.. Fix the posterior approximation $q(z_t \vert z_1, z_0)$ to be a Delta distribution at $tz_1 + (1 - t)z_0$ (linearly interpolating between observations). Let $p(z_{t+\Delta} \vert z_t)$ be distributed as $\mathcal{N}(z_t + \Delta f_\theta(z_t, t), \sigma^2/\Delta^2)$, where $f_\theta$ is a neural network parameterized by $\theta$. Under this linear interpolation scheme, $z_{t + \Delta} - z_t = \Delta (z_1 - z_0)$. This means that \(\begin{align*} &E\left[\log p((z_{t+\Delta} \vert z_t)\right] \\ &= E\left[-\frac{1}{2} \frac{(z_{t + \Delta} - [z_t + \Delta f_\theta (z_t, t)])^2}{\sigma^2 \Delta^2}\right] \\ &= E\left[\frac{-1}{2\sigma^2 \Delta} ((z_1 - z_0) - f_\theta (z_t, t))^2\right] \end{align*}\)

We can plug this simplification into the full ELBO, noting that the sum over time can be written as an expectation.

\[E_{z_0} \left[ \log p(z_0) - \log q(z_0\vert z_1) + \sum_t \log p(z_{t+\Delta} \vert z_t) \right] = \\ -KL[q(z_0\vert z_1) \vert p(z_0)] + \frac{-1}{2\sigma^2} E_{t, z_0} \left[ ((z_1 - z_0) - f_\theta (z_t, t))^2 \right]\]

It breaks into two parts: a KL divergence keeping $q(z_0\vert z_1)$ close to $p(z_0)$, and a stochastic squared loss in which both time and latent code are sampled. As $\Delta \to 0$, the distribution over times approaches uniform. This is precisely the loss for the Rectified Flow model for the case of a fixed $q(z_0 \vert z_1)$. Originally, the authors interpreted this loss as a way to learn an ODE $\frac{\partial z}{\partial t} = f(z, t)$ which transports samples $z_0$ to observations $z_1$. By casting the loss in the language of variational auto-encoders, however, we can learn a posterior distribution for $z_0$ at the same time.

Categories:

Updated: