Matching in Observational Studies¶
A 'matching' quasi-experimental design controls for confounder variables $x$ by estimating what the control outcomes $y$ would be if the control population had the same values of $x$ as the treatment population. To do this, we regress outcomes in the control population on $x$, and apply this regression model to the treatment population's confounder distribution.
import os
os.environ["JAX_PLATFORMS"] = "cpu"
import numpyro
numpyro.set_host_device_count(4)
import pyro_util
import pandas as pd
import numpy as np
import numpyro.distributions as dist
import arviz as az
import xarray as xr
import seaborn as sns
import preliz as pz
from scipy.stats import zscore
import jax.numpy as jnp
The book Causal Inference: The Mixtable describes the problem of estimating the effect on fugure earnings of attending a job training program, specifically the National Supported Work Demonstration program established by the US government in the '70s. We want to compare changes in earnings for those who enrolled in the program to changes in earnings of the general US population as reported by the Current Population Survey. By using the matching model above, we can control for the fact that those who enrolled in the program were generally less educated and younger than the overall population.
cps = pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/cps_mixtape.dta")
nsw = pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/nsw_mixtape.dta")
df = pd.concat([nsw, cps])
df['treated'] = df.treat == 1.0
df['improvement'] = df.re78 - df.re75
sns.displot(df, x="age", hue="treated", kind='ecdf');
sns.displot(df, x="educ", hue="treated", kind='ecdf');
We'll try to match participants based on education level and age.
xs = ['educ', 'age']
sns.relplot(df, x="educ", y="improvement", kind="line", errorbar="sd");
sns.relplot(df, x="age", y="improvement", kind="line", errorbar="sd");
We can see that these covariates have a noticable effect on improvement, so accounting for them will be important. Treated participants are more likely to be younger, and younger participants generally showed more improvement. So a naively taking a difference between the populations' improvements might underestimate the true effect of the program. At the same time, the treatment group was also less educated and we saw more improvement among more educated people.
I'll use Gaussian process regression with RBF kernels. These kernels will have a loose prior over lengthscales wide enough to account for any of the distances observed in the data.
def lengthscale_params(xs):
x_vals = xs.to_numpy()
differences = np.abs(np.subtract.outer(x_vals, x_vals))
nz = differences[differences != 0]
l_b = np.quantile(nz, 0.025)
u_b = np.quantile(nz, 0.975)
dist = pz.InverseGamma()
pz.maxent(dist, l_b, u_b, 0.95, plot=False)
return dist.params
A quick glance at the improvment distribution suggests that we need to model a zero-improvement outcome as a special case.
sns.displot(df, x="improvement")
<seaborn.axisgrid.FacetGrid at 0x17fb0ab50>
I'll only sample from the Gaussian Process if the latent 'nonzero-outcome' Bernoulli $z$ is true.
def matching_model(control_x, treated_x, l_prior, s, y0, y1_mean, ell):
with numpyro.plate("ldim", l_prior[0].shape[0]):
l = numpyro.sample("l", dist.InverseGamma(*l_prior))
alpha = numpyro.sample("alpha", dist.HalfNormal(s / 3))
sigma = numpyro.sample("sigma", dist.HalfNormal(s))
f = pyro_util.hsgp_rbf("f", length=l, alpha=alpha, m=15, ell=ell)
p = numpyro.sample("p", dist.Beta(1,1))
mask = (y0 != 0.0)
with numpyro.plate("control", control_x.shape[0]):
numpyro.sample("z", dist.Bernoulli(p), obs=mask)
numpyro.sample("y0", dist.Normal(loc=f.at(control_x[mask]),
scale=sigma), obs=y0[mask])
numpyro.deterministic("effect", y1.mean() - (f.at(treated_x) * p).mean())
dfz = zscore(df[xs])
control_x = jnp.array(dfz[~df.treated].values)
treated_x = jnp.array(dfz[df.treated].values)
s = df.improvement[~df.treated].std()
ell = 2 * dfz[xs].abs().max().to_numpy()[:,None]
y0 = df[~df.treated].improvement.to_numpy()
y1_mean = df[df.treated].improvement.to_numpy().mean()
l_prior = np.array([lengthscale_params(dfz.loc[~df.treated,x]) for x in xs]).T
def model(predictive=False):
matching_model(control_x, treated_x, l_prior, s, y0, y1_mean, ell)
mcmc = pyro_util.fit_nuts(model, num_samples=1500)
0%| | 0/2000 [00:00<?, ?it/s]
0%| | 0/2000 [00:00<?, ?it/s]
0%| | 0/2000 [00:00<?, ?it/s]
0%| | 0/2000 [00:00<?, ?it/s]
results = pyro_util.from_numpyro(df, model, mcmc, predictive=False)
az.plot_posterior(results, var_names="effect");
The matching model estimates that the job training program led to between 2704 and 2991 in additional inflation adjusted earnings, which is, as we suspected, lower than a naive difference of means would suggest.
(df[df.treated]['improvement']).mean() - (df[~df.treated]['improvement']).mean()
3587.7637
The GP Interface¶
In the model above I used the function pyro_util.hsgp_rbf
, which I defined as follows:
from collections import namedtuple
from typing import Union
class hsgp(namedtuple("hsgp", "spd beta ell m")):
__slots__ = ()
def at(self, x):
phi = eigenfunctions(x=x, ell=self.ell, m=self.m)
return phi @ (self.spd * self.beta)
def hsgp_rbf(
prefix: str,
alpha: float,
ell: float,
m: int,
length: Union[float,list[float]]) -> hsgp:
dim = len(length) if hasattr(length.__class__, "__len__") else 1
spd = jnp.sqrt(diag_spectral_density_squared_exponential(
alpha=alpha, length=length, ell=ell, m=m, dim=dim))
with handlers.scope(prefix=prefix):
with numpyro.plate("basis", len(spd)):
beta = numpyro.sample("beta", dist.Normal())
return hsgp(spd, beta, ell, m)
This allows me to reference the same Gaussian Process "f" multiple times within my model above, unlike the hsgp_squared_exponential
function currently within Numpyro.