Matching in Observational Studies

A 'matching' quasi-experimental design controls for confounder variables \(x\) by estimating what the control outcomes \(y\) would be if the control population had the same values of \(x\) as the treatment population. To do this, we regress outcomes in the control population on \(x\), and apply this regression model to the treatment population's confounder distribution.

import os
os.environ["JAX_PLATFORMS"] = "cpu"
import numpyro
numpyro.set_host_device_count(4)
import pyro_util
import pandas as pd
import numpy as np
import numpyro.distributions as dist
import arviz as az
import xarray as xr
import seaborn as sns
import preliz as pz
from scipy.stats import zscore
import jax.numpy as jnp

The book Causal Inference: The Mixtable describes the problem of estimating the effect on fugure earnings of attending a job training program, specifically the National Supported Work Demonstration program established by the US government in the '70s. We want to compare changes in earnings for those who enrolled in the program to changes in earnings of the general US population as reported by the Current Population Survey. By using the matching model above, we can control for the fact that those who enrolled in the program were generally less educated and younger than the overall population. The analysis below deviates substantially from that described in the book.

cps = pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/cps_mixtape.dta")
nsw = pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/nsw_mixtape.dta")
df = pd.concat([nsw, cps])
df['treated'] = df.treat == 1.0
df['improvement'] = df.re78 - df.re75

This plot shows that the treated population is younger on average.

sns.displot(df, x="age", hue="treated", kind='ecdf');

png

This plot shows how they're less educated as well.

sns.displot(df, x="educ", hue="treated", kind='ecdf');

png

The plot below shows that education is a confounder, affecting both treatment and improvement.

sns.relplot(df, x="educ", y="improvement", kind="line", errorbar="sd");

png

The same is true for age:

sns.relplot(df, x="age", y="improvement", kind="line", errorbar="sd");

png

As treated participants are more likely to be younger and younger participants generally showed more improvement, naively taking a difference between the populations' improvements might over-estimate the true effect of the program on the treated population. At the same time, the treatment group was also less educated and we saw more improvement among more educated people, so a naive difference might under-estimate the true effect as well. We control for both of these confounders by using a Gaussian process to predict the improvements the treatment population would have had had they not been treated.

I'l use RBF kernels in the GP. These kernels will have a loose prior over lengthscales wide enough to account for any of the distances observed in the data.

def lengthscale_params(xs):
    x_vals = xs.to_numpy()
    differences = np.abs(np.subtract.outer(x_vals, x_vals))
    nz = differences[differences != 0]
    l_b = np.quantile(nz, 0.025)
    u_b = np.quantile(nz, 0.975)
    dist = pz.InverseGamma()
    pz.maxent(dist, l_b, u_b, 0.95, plot=False)
    return dist.params

A quick glance at the improvment distribution suggests that we need to model a zero-improvement outcome as a special case. I'll only sample from the Gaussian Process if the latent 'unimproved' Bernoulli \(z\) is true.

sns.displot(df, x="improvement")
<seaborn.axisgrid.FacetGrid at 0x30e99ff70>

png

df["unimproved"] = df.improvement == 0

It seems like a lack of improvement is slightly correlated with education and age.

sns.displot(df, x="educ", hue="unimproved", kind='ecdf')
<seaborn.axisgrid.FacetGrid at 0x30fe2ef70>

png

sns.displot(df, x="age", hue="unimproved", kind='ecdf')
<seaborn.axisgrid.FacetGrid at 0x310f821c0>

png

Together, these observations lead to the following model:

def matching_model():
    control = df[~df.treated]
    f, f_control = pyro_util.hsgp_rbf(
        control, "zscore(age) + zscore(educ)")
    unimproved_model, mask = pyro_util.glm(
        control, "unimproved ~ zscore(age) + zscore(educ)", family=dist.Bernoulli)
    sigma = numpyro.sample("sigma", dist.HalfNormal(control.improvement.std())
    numpyro.sample("y0", dist.Normal(loc=f_control[mask], scale=sigma), obs=control.improvement[mask])
    treated = df[df.treated]
    numpyro.deterministic("effect", treated.improvement.mean() - (f.at(treated) * p.at(treated)).mean())
mcmc = pyro_util.fit_nuts(model, num_samples=1500)
results = pyro_util.from_numpyro(df, model, mcmc, predictive=False)
az.plot_posterior(results, var_names="effect");

png

The matching model estimates that the job training program led to between 2704 and 2991 in additional inflation adjusted earnings, which is, as we suspected, lower than a naive difference of means would suggest.

(df[df.treated]['improvement']).mean() - (df[~df.treated]['improvement']).mean()
3587.7637

The GP Interface

In the model above I used the function pyro_util.hsgp_rbf, which I defined as follows:

from collections import namedtuple
from typing import Union

class hsgp(namedtuple("hsgp", "spd beta ell m")):
    __slots__ = ()
    def at(self, x):
        phi = eigenfunctions(x=x, ell=self.ell, m=self.m)
        return phi @ (self.spd * self.beta)

def hsgp_rbf(
    prefix: str,
    alpha: float,
    ell: float,
    m: int,
    length: Union[float,list[float]]) -> hsgp:
    dim = len(length) if hasattr(length.__class__, "__len__") else 1
    spd = jnp.sqrt(diag_spectral_density_squared_exponential(
            alpha=alpha, length=length, ell=ell, m=m, dim=dim))
    with handlers.scope(prefix=prefix):
        with numpyro.plate("basis", len(spd)):
            beta = numpyro.sample("beta", dist.Normal())
    return hsgp(spd, beta, ell, m)

This allows me to reference the same Gaussian Process "f" multiple times within my model above, unlike the hsgp_squared_exponential function currently within Numpyro.