This blog post is incomplete and currently being written.

Mixing Time Bounds via Functional Analysis

Problem Statement. Let $\mu$ be a probability distribution on some state space $\Omega$, and let $\mathsf P$ be a Markov chain on $\Omega$ which is reversible w.r.t. $\mu$. Our goal is to study the total variation mixing time of $\mathsf P$, which controls how efficient it is to use $\mathsf P$ as a sampler. The classical approaches contain the path coupling methods and the Poincaré Inequalities/spectral gap methods to control the mixing time $T_{\text{mix}}$. Both of these are instantiations of a much more general strategy.

Theorem 1. Show that some other measure of “distance” between probability measures $\mathscr D(\cdot \| \cdot)$ contracts under every application of $\mathsf P$. In other words, for some $0 < \alpha < 1$ which is not too small, we have

$$ \mathscr{D}(\nu\mathsf{P}\parallel\mu)\leq(1-\alpha)\cdot\mathscr{D}(\nu\parallel\mu),\quad\forall\textit{ probability measures }\nu\textit{ on }\Omega.\tag{1} $$

Fact 2. If (1) holds for some $\mathscr D(\cdot \| \cdot)$ such that $\mathscr D(\nu \| \mu) \leq\epsilon$ implies $\|\mu − \nu\|_{\text{TV}} \leq O(\epsilon^c )$ for some $c > 0$, then

$$ \mathrm{T}{\mathrm{mix}}(\epsilon)\leq O\left(\frac1\alpha\log\left(\frac{\max{x\in\Omega}\mathscr{D}(\delta_x\parallel\mu)}\epsilon\right)\right) $$

Of course, if $\mathscr D(\cdot \| \cdot)$ is total variation distance itself, then we immediately get rapid mixing. However, TV-distance isn’t very easy to work with in general, and it very well could decay in a highly irregular manner. So, we typically pick a nicer “smoother” $\mathscr D(\cdot\|\cdot)$ such that (1) holds and we make quantifiable progress in every single step.(Actually this is exactly we did when considering sampling from some spin systems). Here are two examples:

Example 3(Bubley–Dyer Path Coupling). If we endow $\Omega$ with the structure of an undirected graph $(\Omega, E)$, then we can take $\mathscr D(\nu \| \mu)$ to be the Wasserstein distance (or transportation distance)

$$ \mathscr{W}1(\mu,\nu)\stackrel{\mathsf{def}}{=}\inf\xi\mathbb{E}_{(x,y)\thicksim\xi}[\mathrm{dist}(x,y)], $$

w.r.t. the shortest path metric $\text{dist}(x, y)$ in $(\Omega, E)$, where the infimum is over all couplings $\xi$ of $\mu, \nu$. By composing couplings along shortest paths, to show $\mathscr W_1(\mu\mathsf P, \nu\mathsf P) \leq (1−\alpha)·\mathscr W_1(\mu, \nu)$, it suffices to prove that for every pair of neighboring states $(x,y)\in E\subseteq\binom{\Omega}{2}$, we have $\mathscr{W}_1(\delta_x\mathsf{P},\delta_y\mathsf{P})\leq1{-}\alpha$. This dramatically simplifies the task of proving mixing time upper bounds. In this context, the number $\alpha$ is sometimes called the coarse Ricci curvature (or Ollivier–Ricci curvature) of the measure metric space $(\Omega, E, \mathsf P)$ [Oll09]. We also have the trivial bound $\mathscr{W}_1(\mu\mathsf{P},\nu\mathsf{P})\leq\text{diam}(\Omega,E)$. Note that since $\text{dist}(\cdot, \cdot)$ takes values in $\N$, we always have $\mathscr{W}1(\mu,\nu)\geq\left\|\mu-\nu\right\|{\mathsf{TV}}$. Furthermore, if we took $E=\binom{\Omega}{2}$ so that $\text{dist}(\cdot,\cdot)$ becomes the discrete metric, then $\mathscr W_1$ is exactly TV-distance. However, if $\Omega$ has product structure for instance, we can do much better by using Hamming distance.

Example 4(Poincaré Inequality). If we take $\mathscr D(\nu \| \mu)$ to be $\chi^2(\nu\parallel\mu)=\mathrm{Var}_\mu\left(\frac{d\nu}{d\mu}\right)$, then contraction (1) follows from the Poincaré Inequality

$$ \gamma\cdot\mathrm{Var}\mu(f)\leq\mathcal{E}\mathsf{P}(f,f),\quad\forall f:\Omega\to\mathbb{R}, $$

where recall

$$ \mathcal{E}{\mathsf{P}}(f,f)=\langle f,(\mathsf{Id}-\mathsf{P})f\rangle\mu=\frac12\sum_{x,y\in\Omega}\mu(x)\mathsf{P}(x\to y)\cdot(f(x)-f(y))^2 $$

is the Dirichlet form of $\mathsf P$. The best choice of $\gamma$ is the (absolute) spectral gap of $\mathsf P$. Combining this with the comparison $\|\mu − ν\|^2_{\text{TV}} ≤\frac1 4 \chi^2 (\nu \|\mu)$ implies rapid mixing assuming a Poincaré Inequality holds with a good $\gamma$.