Other articles


  1. Bayesian Power Analysis for A/B/n Tests

    This post highlights a Bayesian approach to sample size estimation in A/B/n testing. Say we're trying to test which variant of an email message generates the highest response rate from a population. We consider \(k\) different messages and send out \(n\) emails for each message. After we wait …

    read more
  2. Finding Common Topics

    How do you find thematic clusters in a large corpus of text documents? The techniques baked into sklearn (e.g. nonnegative matrix factorization, LDA) give you some intuition about common themes. But contemporary NLP has largely moved on from bag-of-words representations. We can do better with some transformer models!

    For …

    read more
  3. Matching in Observational Studies

    A 'matching' quasi-experimental design controls for confounder variables \(x\) by estimating what the control outcomes \(y\) would be if the control population had the same values of \(x\) as the treatment population. To do this, we regress outcomes in the control population on \(x\), and apply this regression model to …

    read more
  4. Finite Basis Gaussian Processes

    By Mercer's theorem, every positive definite kernel \(k(x, y) : \mathcal{X} \to \mathcal{X} \to \mathbb{R}\) that we might want to use in a Gaussian Process corresponds to some inner product \(\langle \phi(x), \phi(y) \rangle\), where \(\phi : \mathcal{X} \to \mathcal{V}\) maps our inputs into …

    read more
  5. Finite Particle Approximations

    Say you have a discrete distribution \(\pi\) that you want to approximate with a small number of weighted particles. Intuitively, it seems like the the best choice of particles would be the outputs of highest probability under \(\pi\), and that the relative weights of these particles should be the same …

    read more
  6. Nearest Neighbor Gaussian Processes

    In a k-Nearest Neighbor Gaussian Process, we assume that the input points \(x\) are ordered in such a way that \(f(x_i)\) is independent of \(f(x_j)\) whenever...

    read more
  7. Fast SLAM

    This notebook looks at a technique for simultaneous localization (finding the position of a robot) and mapping (finding the positions of any obstacles), abbreviated as SLAM. In this model, the probability distribution for the robot's trajectory \(x_{1:t}\) is represented with a set of weighted particles. Let the weight …

    read more
  8. Mapping with Gaussian Conditioning

    For a robot to navigate autonomously, it needs to learn the locations of any potential obsticles around it. One of the standard ways to do this is with an algorithm known as EKF-Slam. Slam stands for "simultaneous localization and mapping", as the algorithm must simultaneously find out where the robot …

    read more
  9. Conjugate Computation

    This post is about a technique that allows us to use variational message passing on models where the likelihood doesn't have a conjugate prior. There will be a lot of Jax code snippets to make everything as concrete as possible.

    The Math

    Say \(X\) comes from a distribution with density …

    read more
  10. Sparse Variational Gaussian Processes

    This notebook introduces Fully Independent Training Conditional (FITC) sparse variational Gaussian process model. You shouldn't need any prior knowledge about Gaussian processes- it's enough to know how to condition and marginalize finite dimensional Gaussian distributions. I'll assume you know about variational inference and Pyro, though.

    import pyro
    import pyro.distributions …
    read more
  11. Differential Equations Refresher

    In my freshman year of college, I took an introductory differential equations class. That was nine years ago. I've forgotten pretty much everything, so I thought I'd review a little, trying to generalize the techniques along the way. I'll use summation notation throughout, and write \(\frac{\partial^n}{\partial x …

    read more
  12. Fun with Likelihood Ratios

    Say you're trying to maximize a likelihood \(p_{\theta}(x)\), but you only have an unnormalized version \(\hat{p_{\theta}}\) for which \(p_{\theta}(x) = \frac{\hat{p_\theta}(x)}{N_\theta}\). How do you pick \(\theta\)? Well, you can rely on the magic of self normalized importance sampling.

    read more