Switching State Space Models and Structured Variational Inference

alt text

Used in broad range of areas from robotics to finance, State Space Models (a.k.a. Kalman filters) assume a stationary generative process of temporal data. Some temporal data, on the other hand, are generated by switching regimes through time. For example, a mechanical part of a wind turbine could be tracked well by only one Kalman filter if the wind speed were constant. However, wind turbines should be designed in such a way that the behaviour of mechanical parts are segmented according to speed and direction of wind for the sake of efficient energy transformation. So, detection of wind regime is crucial in wind turbine design. Switching State Space Models (SSSMs) address this kind of non-stationary behaviours with mixture of Kalman filters. Depicted in the figure above, an SSSM is a composition of Kalman filter and hidden Markov models (HMMs). To be more precise, the generative model:

\[p(\mathbf{y}_{1:T},\mathbf{x}_{1:T},\mathbf{z}_{1:T}) = p(\mathbf{z}_1)p(\mathbf{x}_1|\mathbf{z}_1)p(\mathbf{y}_1|\mathbf{x}_1,\mathbf{z}_1)\prod\limits_{t=2}^{T} p(\mathbf{z}_t|\mathbf{z}_{t-1})p(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{z}_t)p(\mathbf{y}_t|\mathbf{x}_t,\mathbf{z}_t).\]

Let’s take a closer look at the conditional distributions, starting with the continuous latent variables:

\[p(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{z}_t) = \mathcal{N}(\mathbf{x}_t;\mathbf{A}_t\mathbf{x}_{t-1},\mathbf{V}_t) = \prod\limits_{k=1}^{K} \mathcal{N}(\mathbf{x}_t;\mathbf{A}^k\mathbf{x}_{t-1},\mathbf{V}^k)^{p(z_t=k)}\]

where $\mathbf{A}_t \in \{ \mathbf{A}^1,…,\mathbf{A}^K \}$ and $\mathbf{V}_t \in \{ \mathbf{V}^1,…,\mathbf{V}^K \}$. Here $z_t$ stands for a scalar value in the range $[1,K]$ and indicates a discrete choice over model parameters. Notice that $\mathbf{z}_t$ is the one-hot encoded representation of $z_t$. Similarly, the likelihood function is

\[p(\mathbf{y}_t|\mathbf{x}_t,\mathbf{z}_t) = \mathcal{N}(\mathbf{y}_t;\mathbf{B}_t\mathbf{x}_t,\mathbf{W}_t) = \prod\limits_{k=1}^{K} \mathcal{N}(\mathbf{y}_t;\mathbf{B}^k\mathbf{x}_t,\mathbf{W}^k)^{p(z_t=k)}\]

where $\mathbf{B}_t \in \{ \mathbf{B}^1,…,\mathbf{B}^K \}$ and $\mathbf{W}_t \in \{ \mathbf{W}^1,…,\mathbf{W}^K \}$. Conditional distribution of discrete latent varibales is

\[p(\mathbf{z}_t|\mathbf{z}_{t-1}) = \mathcal{Cat}(\mathbf{M}\mathbf{z}_{t-1})\]

where the transition matrix $\mathbf{M}$ consists of non-negative entries and the columns are sum up to one. If our task also includes system identification, within a full Bayesian setting, we can introduce prior probabilities on model parameters. For $\mathbf{M}$, the conjugate prior would be defined on its columns as the following: $p(\mathbf{M_{:k}}) = \mathcal{Dir}(\alpha^{k})$. Prior distribution allows us to reflect our knowledge about the model to the inference stage. For example, frequent regime changes can be disfavored by keeping $k^{th}$ value of $\alpha^{k}$ larger than the other elements of $\alpha^{k}$. Certainty about the prior knowledge is reflected with the magnitude of $\alpha^{k}$’s elements. Priors on $\mathbf{A}, \mathbf{B}, \mathbf{V}, \mathbf{W}$ can be defiened in a similar fashion but for the sake of simplicity, we assume they are known or carefully designed by field experts.

As it can be imagined, exact inference for such a model is intractable. So, we resort to variational bound optimization with some relaxations in the recognition distribution. The most relaxed recognition model choice is the one all the random variables are factorized. However, it is a better practice to stick to the original model factorization as much as possible in order to keep variational objective tighter to the log-likelihood. The following recognition distribution takes the model structure into account and ables iterative updates of posterior approximations by breaking the loops in the model:

\[q(\mathbf{x}_{1:T},\mathbf{z}_{1:T},\mathbf{M}) = q_x(\mathbf{x}_{1:T})q_z(\mathbf{z}_{1:T})q_M(\mathbf{M})\]

The above recognition distribution leads to the following iterative update rule of $\mathbf{x}_{1:T}$:

\[\begin{align} q_x(\mathbf{x}_{1:T}) &\propto \exp(\mathbb{E}_{q_{z}q_{M}}(\log p(\mathbf{y}_{1:T},\mathbf{x}_{1:T},\mathbf{z}_{1:T}))) \\ &\propto \prod\limits_{t=1}^{T}{\exp(\mathbb{E}_{q_{z}}(\log p(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{z}_t)))\exp(\mathbb{E}_{q_{z}}(\log p(\mathbf{y}_t|\mathbf{x}_t,\mathbf{z}_t)))} \end{align}\]

Discrete variable $\mathbf{z_t}$ is directly tied to model parameters at time $t$, so the expectations with respect to $q_{z}$ can be shrinked to expectations of model parameters as the followings:

\[\begin{align} \left< \mathbf{A}_t \right> &= \prod\limits_{k=1}^{K} q_z(z_t=k)\mathbf{A}^k \\ \left< \mathbf{B}_t \right> &= \prod\limits_{k=1}^{K} q_z(z_t=k)\mathbf{B}^k \\ \left< \mathbf{V}_t \right> &= \prod\limits_{k=1}^{K} q_z(z_t=k)\mathbf{V}^k \\ \left< \mathbf{W}_t \right> &= \prod\limits_{k=1}^{K} q_z(z_t=k)\mathbf{W}^k \\ \end{align}\]

Replacing the model parameters with their expected values, the marginals $q_{x}(\mathbf{x_t})$ and joint marginals $q_{x}(\mathbf{x_{t}},\mathbf{x_{t+1}})$ of interest can easily be calculated via belief propagation. Similarly, iterative update of $q_z(\mathbf{z}_{1:T})$ is written as the following:

\[q_z(\mathbf{z}_{1:T}) \propto \prod\limits_{t=1}^{T} \exp(\left< \log p(\mathbf{z}_t|\mathbf{z}_{t-1},\mathbf{M}) \right>_{q_M}) \exp(\left< \log p(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{z}_t) \right>_{q_x(\mathbf{x}_{t},\mathbf{x}_{t-1})}) \exp(\left< \log p(\mathbf{y}_t|\mathbf{x}_t,\mathbf{z}_t) \right>_{q_x(\mathbf{x}_{t})})\]

One can realize that the above recognition factorization refers to a hidden Markov model (HMM) with the following transition and emission probabilities:

\[\exp(\left< \log p(\mathbf{z}_t|\mathbf{z}_{t-1},\mathbf{M}) \right>_{q_M}) = \mathcal{Cat}( \exp(\left< \log \mathbf{M} \right>) \mathbf{z}_{t-1})\] \[\begin{align} \exp(\left< \log p(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{z}_t) \right>_{q_x(\mathbf{x}_{t},\mathbf{x}_{t-1})} &\propto \prod\limits_{k=1}^{K} \left( \exp\left[ - 0.5 \left< (\mathbf{x}_t - \mathbf{A}^k\mathbf{x}_{t-1})^T \mathbf{V}^{k^{-1}} (\mathbf{x}_t - \mathbf{A}^k\mathbf{x}_{t-1}) \right> \right] \right)^{[z_t==k]} \\ &\propto \prod\limits_{k=1}^{K} \left( \exp\left[ - 0.5 \left(\text{tr}\left(\mathbf{V}^{k^{-1}} \left< \mathbf{x}_t \mathbf{x}_t^T \right> \right) -\text{tr}\left(\mathbf{A}^{k^T}\mathbf{V}^{k^{-1}} \left< \mathbf{x}_t \mathbf{x}_{t-1}^T \right> \right) - \text{tr}\left(\mathbf{V}^{k^{-1}} \mathbf{A}^{k} \left< \mathbf{x}_{t-1} \mathbf{x}_t^T \right> \right) + \text{tr}\left(\mathbf{A}^{k^T}\mathbf{V}^{k^{-1}} \mathbf{A}^{k} \left< \mathbf{x}_{t-1} \mathbf{x}_{t-1}^T \right> \right) \right) \right] \left|\mathbf{V}^{k} \right|^{-0.5} \right)^{[z_t==k]} \end{align}\] \[\exp(\left< \log p(\mathbf{y}_t|\mathbf{x}_{t},\mathbf{z}_t) \right>_{q_x(\mathbf{x}_{t})} \propto \prod\limits_{k=1}^{K} \left( \exp\left[ - 0.5 \left< (\mathbf{y}_t - \mathbf{B}^k\mathbf{x}_{t})^T \mathbf{W}^{k^{-1}} (\mathbf{y}_t - \mathbf{B}^k\mathbf{x}_{t}) \right> \right] \right)^{[z_t==k]}\]

Analogous to continues hidden variables, marginal probabilities of discrete latent variables can be computed via forward-backward message passing. Remember that we defined Dirichlet prior on the columns of the transition matrix $\mathbf{M}$ to incorporate our prior belief in model specification for example a prior matrix with large diagonal entries disfavors frequently switching regimes. The posterior update of $\mathbf{M}$ then is

\[q_M(\mathbf{M}_{:,k}) = \mathcal{D}ir\left(\mathbf{\alpha}^k + \sum\limits_{t=1}^{T-1}q_z(z_t=k,z_{t+1})\right)\]
Written on September 14, 2019