Fitting Gaussians with Missing Observations

using LinearAlgebra, StatsBase, StatsModels, LogExpFunctions, Random, Distributions

Say you want to fit a multivariate Normal distribution to some data.

Random.seed!(0)

TaskLocalRNG()

Σ = rand(LKJ(3, 1))

3×3 Matrix{Float64}:
  1.0       -0.192969  -0.194527
 -0.192969   1.0       -0.846977
 -0.194527  -0.846977   1.0

μ = rand(3)

3-element Vector{Float64}:
 0.2895098423219379
 0.028549977665983994
 0.538639413965653

X = (μ .+ sqrt(Σ) * randn(3, 300))'

300×3 adjoint(::Matrix{Float64}) with eltype Float64:
 -0.258093     0.0699272   0.515201
 -0.0370811   -1.57983     2.67111
  0.135337     0.38011     0.352973
  2.28129     -1.86795     1.51179
  0.642071     1.22412    -0.567141
  0.91815      0.696269   -0.538011
 -1.1201      -0.756545    1.64353
  0.74597      0.452027    0.746104
 -0.928414     2.59154    -1.33622
  1.49796      0.800874   -1.23821
  ⋮                       
 -1.21284     -0.394641    1.30832
  0.865095    -1.35088     1.81753
  0.00223982  -0.316666    1.06955
 -1.11874     -0.0782368   0.910057
 -0.00174339  -1.01046     1.77748
 -1.64443      0.735486    1.0054
 -0.0329084    0.585624   -0.505953
  1.71293     -0.641521    0.970406
  0.260026     0.651592    0.155827

The obvious way to do this is with maximum likelihood estimation. In natural parameter form, the log likelihood is given by

\[ \sum_i x_i^T \Lambda \mu - \frac{1}{2} \text{Tr}(\Lambda x_i x_i^T) - N A(\eta) \] Derive wrt the natural parameters \(\eta = (\Lambda, \Lambda \mu)\) to get

\[ \begin{align*} \frac{1}{N}\sum_i x_i &= E[x] \frac{1}{N}\sum_i x_i x_i^T &= E[xx^T] \end{align*} \]

Easy. But now, how might you do this if some subset of the data is missing? We’ll assume that the data isn’t missing completely at random, so that our observed marginals aren’t unbiased estimators of the true marginals.

begin
missing_mask = rand(size(X, 1)) .< logistic.(-1 .+ X * -rand(size(X, 2), 3))
Xm = convert(Matrix{Union{Float64, Missing}}, X)
Xm[missing_mask] .= missing
Xm
end

300×3 Matrix{Union{Missing, Float64}}:
 -0.258093     0.0699272   0.515201
 -0.0370811   -1.57983     2.67111
   missing      missing     missing
  2.28129     -1.86795     1.51179
  0.642071     1.22412    -0.567141
  0.91815      0.696269     missing
 -1.1201      -0.756545    1.64353
  0.74597      0.452027    0.746104
 -0.928414      missing   -1.33622
  1.49796      0.800874   -1.23821
  ⋮                       
 -1.21284     -0.394641    1.30832
  0.865095    -1.35088     1.81753
  0.00223982  -0.316666    1.06955
 -1.11874     -0.0782368   0.910057
   missing      missing     missing
 -1.64443       missing    1.0054
   missing      missing     missing
  1.71293     -0.641521    0.970406
  0.260026     0.651592    0.155827

sum(ismissing.(Xm); dims=1) ./ size(Xm, 1)

1×3 Matrix{Float64}:
 0.25  0.263333  0.24

We’re trying to find a maximum likelihood estimate in the presence of latent variables (in this case, all the unobserved values). It’s a classic situation for the EM algorithm!

At each step of EM, we want to find parameters \(\mu, \Sigma\) that maximize the expected log likelihood, where the expectation is taken with respect to \(p(x_\text{unobserved} | x_\text{observed})\). That’s

\[ \begin{align*} E\left[\sum_i x_i^T \Lambda \mu - \frac{1}{2} \text{Tr}(\Lambda x_i x_i^T) - N A(\eta)\right] = \sum_i E[x_i]^T \Lambda \mu - \frac{1}{2} \text{Tr}(\Lambda E[x_i x_i^T]) - N A(\eta) \end{align*} \]

Take the gradient as before to get

\[ \begin{align*} \mu &= \frac{1}{N}\sum_i E[y_i] \Sigma &= \frac{1}{N}\sum_i E[y_i y_i^T] \end{align*} \]

It remains to find out what \(E[x_i]\) and \(E[x_i x_i^T]\) are. For the components of \(x_i\) that are observed (call this subvector \(x_o\)), \(E[x_o] = x_o\). For the components \(x_u\) that are not observed, the Gaussian conditioning formula gives \(x_u = μ_u + Σ_{uo} Σ_{oo}^{-1}(y_o - μ_o)\)

function E_m1(x, μ, Σ)
    x2 = copy(x)
    u = ismissing.(x)
    o = .!u
    x2[u] .= μ[u] + Σ[u, o] * (Σ[o, o] \ (x[o] - μ[o]))
    x2
end

E_m1 (generic function with 1 method)

The second moment breaks down similarly. \(E[x_i x_i^T]\) can be thought of (up to permutation) as a block matrix:

\[ \begin{bmatrix} x_o x_o^T & x_o x_u^T x_u x_o^T & x_u x_u^T \end{bmatrix} \]

As \(x_0\) and \(x_o x_o^T\) are known, that just leaves \(E[x_o x_u^T] = x_o \mu_u^T\) and \(E[x_u x_u^T] = \text{Var}(x_u | x_o) + \mu_u \mu_u^T\). The formula for conditional Gaussians tell us \(\text{Var}(x_u) = \Sigma_{uu} - \Sigma_{uo} \Sigma_{oo}^{-1} \Sigma_{ou}\).

function E_m2(x, μ, Σ)
    x2 = Matrix{Float64}(undef, size(Σ))
    u = ismissing.(x)
    o = .!u
    x2[o,o] .= x[o] * x[o]'
    x2[o, u] .= x[o] * μ[u]'
    x2[u, o] .= x2[o, u]'
    x2[u,u] = Σ[u,u] - Σ[u, o] * (Σ[o,o] \ Σ[o, u]) + μ[u]*μ[u]'
    x2
end

E_m2 (generic function with 1 method)

Let’s put it all together!

function em_step(X, μ, Σ)
    μ = mean(E_m1(x, μ, Σ) for x in eachrow(X))
    m2 = mean(E_m2(x, μ, Σ) for x in eachrow(X))
    (μ, m2 - μ * μ')
end

em_step (generic function with 1 method)

function em_alg(X, μ, Σ)
    for _ in 1:100
        μ2, Σ2 = em_step(X, μ, Σ)
        δ = max(maximum(abs.(μ - μ2)), maximum(abs.(Σ2 - Σ)))
        if δ < 1e-5
            return (μ, Σ)
        end
        μ, Σ = (μ2, Σ2)
    end
end

em_alg (generic function with 1 method)

μ_guess, Σ_guess = em_alg(Xm, randn(3), 1.0I(3))

([0.3165736154417296, 0.001442844333011344, 0.5068980884346564], [0.9352904749796694 -0.14254207406265143 -0.1302467250906596; -0.14254207406265143 0.7845792845571478 -0.6864141697229134; -0.1302467250906596 -0.6864141697229134 0.897912273140797])

(μ, Σ)

([0.2895098423219379, 0.028549977665983994, 0.538639413965653], [1.0 -0.19296912090384166 -0.19452690226681507; -0.19296912090384166 1.0 -0.8469773751344721; -0.19452690226681507 -0.8469773751344721 1.0])

Our guess is pretty close to the true value!