WoLF

Outlier-robust Kalman Filtering through generalises Bayes

G. Duran-Martina,c, M. Altamiranob, A.Y. Shestopaloffa, L. Sanchez-Betancourtc,d, J. Knoblauchb, M. Jonese, F.X. Briolb, K. Murphyf.
aQueen Mary University, bUniversity College London, cOxford-Man Institute of Quantitative Finance, dUniversity of Oxford, eUniversity of Colorado Boulder, fGoogle Deepmind.
Unpublished,

TL;DR

The weighted-observation likelihood filter (WoLF) is a provably robust variant of the Kalman filter (KF) in the presence of outliers and misspecified measurement models. We show its capabilities in tracking problems (via the KF), online learning of neural neworks (via the extended KF), and data assimilation (via the ensemble KF).

The dotted blue line shows the KF posterior mean estimate and the solid orange line shows the WoLF posterior mean estimate.

Problem statement (linear setting)

Consider the state-space model (SSM)

$$ \def\R{\mathbb{R}} \def\bm#1{\boldsymbol #1} \begin{aligned} {\bm\theta_t} &= {\bf F}_t\bm\theta_{t-1} + \bm\phi_t,\\ {\bm y}_t &= {\bf H}_t\bm\theta_t + \bm\varphi_t, \end{aligned} $$

with \(\bm\theta_t\in\R^p\) the (latent) state vector, \({\bm y}_t\in\R^d\) the (observed) measurement vector, \({\bf F}_t\in\R^{p\times p}\) the state transition matrix, \({\bf H}_t\in\R^{d\times p}\), \(\bm\phi_t\) a zero-mean Gaussian-distributed random vector with known covariance matrix \({\bf Q}_t\), and \(\bm\varphi_t\) any zero-mean random vector representing the measurement noise.

We determine either \(\mathbb{E}[\bm\theta_t \vert \bm y_{1:t}]\) or \(\mathbb{E}[\bm y_{t+1} \vert \bm y_{1:t}]\) recursively by applying the predict and modified update equations.

$$ \begin{aligned} q(\bm\theta_t \vert \bm y_{1:t-1}) &= \int p(\bm\theta_t \vert \bm\theta_{t-1}) q(\bm\theta_{t-1} \vert \bm y_{1:t-1}) d\bm\theta_{t-1}\\ q(\bm\theta_t \vert \bm y_{1:t}) &\propto \exp(-\ell_t(\bm\theta_t)) q(\bm\theta_t \vert \bm y_{1:t-1}) \end{aligned} $$

with \(\ell_t(\bm\theta) = -W\left(\bm y_{t}, \hat{\bm y}_t\right)\,\log p(\bm y_t \vert \bm\theta_t)\), and \(W: \mathbb{R}^{d}\times\mathbb{R}^d \to \mathbb{R}\) the weighting function.

Weighted likelihood filter (WoLF)

If the SSM follows linear dynamics, the predict and update equations are

Require \({\bf F}_t, {\bf Q}_t\)// predict step
\(\bm\mu_{t\vert t-1} \gets {\bf F}_t\bm\mu_t\)
\(\bm\Sigma_{t\vert t-1} \gets {\bf F}_t\bm\Sigma_t{\bf F}_t^\intercal + {\bf Q}_t\)

Require \(\bm y_t, {\bf H}_t, {\bf R}_t\)// update step
\(\hat{\bm y}_t \gets {\bf H}_t\bm\mu_{t|t-1}\)
\(w_t \gets W\left(\bm y_{t}, \hat{\bm y}_t\right)\)
\(\bm\Sigma_t^{-1} \gets \bm\Sigma_{t|t-1} + w_t^2 {\bf H}_t^\intercal{\bf R}_t^{-1}{\bf H}_t\)
\({\bf K}_t \gets w_t^2 \bm\Sigma_t{\bf H}_t^\intercal{\bf R}_t^{-1}\)
\(\bm\mu_t \gets \bm\mu_{t|t-1} + {\bf K}_t(\bm y_t - \hat{\bm y}_t)\)

Weighting functions

Two choices of weighting functions:

  1. The inverse multi-quadratic (IMQ) — a compensation-based weighting function, and
  2. The thresholded Mahalanobis distance (TMD) — a detect-and-reject weighting function.

Inverse multi-quadratic weighting function (IMQ) $$ W\left(\bm y_{t}, \hat{\bm y}_t\right) =\left(1+\frac{||\bm y_{t}-\hat{\bm y}_{t}||_2^2}{c^{2}}\right)^{-1/2} $$ with \(c > 0\).

Thresholded Mahalanobis-based weighting function (TMD) $$ W\left(\bm y_{t}, \hat{\bm y}_t\right) = \begin{cases} 1 & \text{if } \|{\bf R}_t^{-1/2}(\bm y_t - \hat{\bm y}_t)\|_2^2 \leq c\\ 0 & \text{otherwise} \end{cases} $$ with \(c > 0\).

Theoretical results

Definition
The posterior influence function (PIF) is $$ \text{PIF}(\bm y_t^c, \bm y_{1:t-1}) = \text{KL}( q(\bm\theta_t \vert \bm y_t^c, \bm y_{1:t-1})\,\|\, q(\bm\theta_t \vert \bm y_t, \bm y_{1:t-1})\,\|\, ). $$

Definition: Outlier-robust filter
A filter is outlier-robust if the PIF is bounded, i.e., $$ \sup_{\bm y_t^c\in\R^d}|\text{PIF}(\bm y_t^c, \bm y_{1:t-1})| < \infty. $$

Theorem
If \(\sup_{\bm y_t\in\R^d} W\left(\bm y_t, \hat{\bm y}_t\right) < \infty\) and \(\sup_{\bm y_t\in\R^d} W\left(\bm y_{t}, \hat{\bm y}_t\right)^2\,\|\bm y_t\|^2 < \infty\) then the PIF is bounded.

    Remarks
  1. The Kalman filter is not outlier-robust.
  2. Filters with IMQ and TMD weighting function are outlier-robust.
PIF for the 2d-tracking problem. The last measurement \(\bm y_t\) is replaced by \(\bm y_t^c = \bm y_t + \bm\epsilon\), where \(\bm\epsilon \in [-5, 5]^2\).

Computational results

Compared to variational-Bayes (VB) methods, which require multiple iterations to converge, WoLF has an equivalent computational cost to the Kalman filter.

Method Time #HP Ref
KF \(O(p^3)\) 0 Kalman1960
KF-B \(O(I\,p^3)\) 3 Wang2018
KF-IW \(O(I\,p^3)\) 2 Agamennoni2012
OGD \(O(I\, p^2)\) 2 Bencomo2023
WoLF-IMQ \(O(p^3)\) 1 (Ours)
WoLF-TMD \(O(p^3)\) 1 (Ours)
Below, \(I\) is the number of inner iterations, \(p\) is the dimension of the state vector, and #HP is the number of hyperparameters.

Experiment: Kalman filter (KF)

2d tracking

Linear SSM with \({\bf Q}_t = q\) \({\bf I}_4\), \({\bf R}_t = r\,{\bf I}_2\),

$$ \begin{aligned} {\bf F}_t &= \begin{pmatrix} 1 & 0 & \Delta & 0\\ 0 & 1 & 0 & \Delta \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}, & {\bf H}_t &= \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 \end{pmatrix}, \end{aligned} $$

\(\Delta = 0.1\) is the sampling rate, \(q = 0.10\) is the system noise, \(r = 10\) is the measurement noise, and \({\bf I}_K\) is a \(K\times K\) identity matrix.

We measure the RMSE of the posterior mean estimate of the state components.

Student variant

Measurements are sampled according to

$$ \begin{aligned} p(\bm y_t \vert \bm\theta_t) &= \text{St}({\bm y_t\,\vert\,{\bf H}_t\bm\theta_t,\,{\bf R}_t,\nu_t})\\ &= \int \mathcal{N}\left(\bm y_t, {\bf H}_t\bm\theta_t, {\bf R}_t/\tau\right) \text{Gam}(\tau \vert \nu_t / 2, \nu_t / 2) d\tau. \end{aligned} $$ Mixture variant

Measurements are sampled according to

$$ \begin{aligned} p(\bm y_t \vert \bm\theta_t) &= {\cal N}\left(\bm y_t \vert {\bf m}_t, {\bf R}_t\right),\\ {\bf m}_t &= \begin{cases} {\bf H}_t\,\bm\theta_t & \text{w.p.}\ 1 - p_\epsilon,\\ 2\,{\bf H}_t\,\bm\theta_t & \text{w.p.}\ p_\epsilon, \end{cases} \end{aligned} $$

with \(\epsilon = 0.05\).


KF results

Sample runs
Mean slowdown rate over KF
Method Student Mixture
KF-B 2.0x 3.7x
KF-IW 1.2x 5.3x
WoLF-IMQ (ours) 1.0x 1.0x
WoLF-TMD (ours) 1.0x 1.0x

Experiment: extended Kalman filter (EKF)

Online learning of UCI datasets

Online training of neural networks on a corrupted version of the tabular UCI datasets. We consider a multilayered perceptron (MLP) with twenty hidden units, two hidden layers, and a real-valued output unit. We evaluate the median squared error (RMedSE) between the true and predicted output.

The SSM is $$ \begin{aligned} {\bm\theta_t} &= \bm\theta_{t-1} + \bm\phi_t\\ {\bm y}_t &= h_t(\bm\theta_t) + \bm\varphi_t, \end{aligned} $$
with \(h_t(\bm\theta_t) = h(\bm\theta_t, {\bf x}_t) \) the MLP and \({\bf x}_t\in\mathbb{R}^m\) the input vector.

We estimate \( \mathbb{E}[\bm\theta_t \vert \bm y_{1:t}] \) via the extended Kalman filter (EKF) — one measurement, one parameter update. We modify the measurement mean with

$$ \mathbb{E}[\bm y_t \vert \bm\theta_t] \approx {\bf H}_t(\bm\theta_t - \bm\mu_{t | t-1}) + h_t(\bm\mu_{t | t-1}) =: \bar{\bm y}_t. $$

EKF Results

RMedSE versus time per step, relative to online gradient descent (OGD), acrross corrupted UCI datasets.

Experiment: ensemble Kalman filter (EnKF)

Data assimilation

The SSM is $$ \begin{aligned} \dot{\bm\theta}_{s,i} &= \Big(\bm\theta_{s, i+1} - \bm\theta_{s, i-2}\Big) \bm\theta_{s, i-1} - \bm\theta_{s,i} + \bm\phi_{s,i},\\ {\bm y}_{s,i} &= \begin{cases} \bm\theta_{s,i} + \bm\varphi_{s,i} & \text{w.p. } 1 - p_\epsilon,\\ 100 & \text{w.p. } p_\epsilon. \end{cases} \end{aligned} $$

Here, \(\bm\theta_{s,k}\) is the value of the state component \(k\) at step \(s\), \(\bm\phi_{s,i} \sim {\cal N}(8, 1)\), \(\bm\varphi_{s,i} \sim {\cal N}(0, 1)\), \(p_\epsilon = 0.001\), \(i = 1, \ldots, d\), \(s = 1, \ldots, S\), with \(S \gg 1\) the number of steps, and \(\bm\theta_{s, d + k} = \bm\theta_{s, k}\), \(\bm\theta_{s, -k} = \bm\theta_{s, d - k}\). We consider the metric \(L_t = \sqrt{\frac{1}{d}(\bm\theta_t - \bm\mu_t)^\intercal (\bm\theta_t - \bm\mu_t)}\).

EnKF results

Bootstrap estimate of \(L_T\) over 20 runs and 500 samples as a function of the \(c\) hyperparameter.