WoLF — Outlier-robust Kalman Filtering through generalises Bayes

TL;DR

The weighted-observation likelihood filter (WoLF) is a provably robust variant of the Kalman filter (KF) in the presence of outliers and misspecified measurement models. We show its capabilities in tracking problems (via the KF), online learning of neural neworks (via the extended KF), and data assimilation (via the ensemble KF).

The dotted blue line shows the KF posterior mean estimate and the solid orange line shows the WoLF posterior mean estimate.

Problem statement (linear setting)

Consider the state-space model (SSM)

$$ \def\R{\mathbb{R}} \def\bm#1{\boldsymbol #1} \begin{aligned} {\bm\theta_t} &= {\bf F}_t\bm\theta_{t-1} + \bm\phi_t,\\ {\bm y}_t &= {\bf H}_t\bm\theta_t + \bm\varphi_t, \end{aligned} $$

with $\bm\theta_t\in\R^p$ the (latent) state vector, ${\bm y}_t\in\R^d$ the (observed) measurement vector, ${\bf F}_t\in\R^{p\times p}$ the state transition matrix, ${\bf H}_t\in\R^{d\times p}$, $\bm\phi_t$ a zero-mean Gaussian-distributed random vector with known covariance matrix ${\bf Q}_t$, and $\bm\varphi_t$ any zero-mean random vector representing the measurement noise.

We determine either $\mathbb{E}[\bm\theta_t \vert \bm y_{1:t}]$ or $\mathbb{E}[\bm y_{t+1} \vert \bm y_{1:t}]$ recursively by applying the predict and modified update equations.

$$ \begin{aligned} q(\bm\theta_t \vert \bm y_{1:t-1}) &= \int p(\bm\theta_t \vert \bm\theta_{t-1}) q(\bm\theta_{t-1} \vert \bm y_{1:t-1}) d\bm\theta_{t-1}\\ q(\bm\theta_t \vert \bm y_{1:t}) &\propto \exp(-\ell_t(\bm\theta_t)) q(\bm\theta_t \vert \bm y_{1:t-1}) \end{aligned} $$

with $\ell_t(\bm\theta) = -W\left(\bm y_{t}, \hat{\bm y}_t\right)\,\log p(\bm y_t \vert \bm\theta_t)$, and $W: \mathbb{R}^{d}\times\mathbb{R}^d \to \mathbb{R}$ the weighting function.

Weighted likelihood filter (WoLF)

If the SSM follows linear dynamics, the predict and update equations are

Require ${\bf F}_t, {\bf Q}_t$// predict step
$\bm\mu_{t\vert t-1} \gets {\bf F}_t\bm\mu_t$
$\bm\Sigma_{t\vert t-1} \gets {\bf F}_t\bm\Sigma_t{\bf F}_t^\intercal + {\bf Q}_t$

Require $\bm y_t, {\bf H}_t, {\bf R}_t$// update step
$\hat{\bm y}_t \gets {\bf H}_t\bm\mu_{t|t-1}$
$w_t \gets W\left(\bm y_{t}, \hat{\bm y}_t\right)$
$\bm\Sigma_t^{-1} \gets \bm\Sigma_{t|t-1} + w_t^2 {\bf H}_t^\intercal{\bf R}_t^{-1}{\bf H}_t$
${\bf K}_t \gets w_t^2 \bm\Sigma_t{\bf H}_t^\intercal{\bf R}_t^{-1}$
$\bm\mu_t \gets \bm\mu_{t|t-1} + {\bf K}_t(\bm y_t - \hat{\bm y}_t)$

Weighting functions

Two choices of weighting functions:

The inverse multi-quadratic (IMQ) — a compensation-based weighting function, and
The thresholded Mahalanobis distance (TMD) — a detect-and-reject weighting function.

Inverse multi-quadratic weighting function (IMQ) $$ W\left(\bm y_{t}, \hat{\bm y}_t\right) =\left(1+\frac{||\bm y_{t}-\hat{\bm y}_{t}||_2^2}{c^{2}}\right)^{-1/2} $$ with $c > 0$.

Thresholded Mahalanobis-based weighting function (TMD) $$ W\left(\bm y_{t}, \hat{\bm y}_t\right) = \begin{cases} 1 & \text{if } \|{\bf R}_t^{-1/2}(\bm y_t - \hat{\bm y}_t)\|_2^2 \leq c\\ 0 & \text{otherwise} \end{cases} $$ with $c > 0$.

Theoretical results

Definition
The posterior influence function (PIF) is $$ \text{PIF}(\bm y_t^c, \bm y_{1:t-1}) = \text{KL}( q(\bm\theta_t \vert \bm y_t^c, \bm y_{1:t-1})\,\|\, q(\bm\theta_t \vert \bm y_t, \bm y_{1:t-1})\,\|\, ). $$

Definition: Outlier-robust filter
A filter is outlier-robust if the PIF is bounded, i.e., $$ \sup_{\bm y_t^c\in\R^d}|\text{PIF}(\bm y_t^c, \bm y_{1:t-1})| < \infty. $$

Theorem
If $\sup_{\bm y_t\in\R^d} W\left(\bm y_t, \hat{\bm y}_t\right) < \infty$ and $\sup_{\bm y_t\in\R^d} W\left(\bm y_{t}, \hat{\bm y}_t\right)^2\,\|\bm y_t\|^2 < \infty$ then the PIF is bounded.

Remarks

The Kalman filter is not outlier-robust.
Filters with IMQ and TMD weighting function are outlier-robust.

PIF for the 2d-tracking problem. The last measurement $\bm y_t$ is replaced by $\bm y_t^c = \bm y_t + \bm\epsilon$, where $\bm\epsilon \in [-5, 5]^2$.

Computational results

Compared to variational-Bayes (VB) methods, which require multiple iterations to converge, WoLF has an equivalent computational cost to the Kalman filter.

Below, $I$ is the number of inner iterations, $p$ is the dimension of the state vector, and #HP is the number of hyperparameters.
Method	Time	#HP	Ref
KF	$O(p^3)$	0	Kalman1960
KF-B	$O(I\,p^3)$	3	Wang2018
KF-IW	$O(I\,p^3)$	2	Agamennoni2012
OGD	$O(I\, p^2)$	2	Bencomo2023
WoLF-IMQ	$O(p^3)$	1	(Ours)
WoLF-TMD	$O(p^3)$	1	(Ours)

Experiment: Kalman filter (KF)

2d tracking

Linear SSM with ${\bf Q}_t = q$ ${\bf I}_4$, ${\bf R}_t = r\,{\bf I}_2$,

$$ \begin{aligned} {\bf F}_t &= \begin{pmatrix} 1 & 0 & \Delta & 0\\ 0 & 1 & 0 & \Delta \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}, & {\bf H}_t &= \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 \end{pmatrix}, \end{aligned} $$

$\Delta = 0.1$ is the sampling rate, $q = 0.10$ is the system noise, $r = 10$ is the measurement noise, and ${\bf I}_K$ is a $K\times K$ identity matrix.

We measure the RMSE of the posterior mean estimate of the state components.

Student variant

Measurements are sampled according to

$$ \begin{aligned} p(\bm y_t \vert \bm\theta_t) &= \text{St}({\bm y_t\,\vert\,{\bf H}_t\bm\theta_t,\,{\bf R}_t,\nu_t})\\ &= \int \mathcal{N}\left(\bm y_t, {\bf H}_t\bm\theta_t, {\bf R}_t/\tau\right) \text{Gam}(\tau \vert \nu_t / 2, \nu_t / 2) d\tau. \end{aligned} $$ Mixture variant

Measurements are sampled according to

$$ \begin{aligned} p(\bm y_t \vert \bm\theta_t) &= {\cal N}\left(\bm y_t \vert {\bf m}_t, {\bf R}_t\right),\\ {\bf m}_t &= \begin{cases} {\bf H}_t\,\bm\theta_t & \text{w.p.}\ 1 - p_\epsilon,\\ 2\,{\bf H}_t\,\bm\theta_t & \text{w.p.}\ p_\epsilon, \end{cases} \end{aligned} $$

with $\epsilon = 0.05$.

KF results

Sample runs

Mean slowdown rate over KF

Method	Student	Mixture
KF-B	2.0x	3.7x
KF-IW	1.2x	5.3x
WoLF-IMQ (ours)	1.0x	1.0x
WoLF-TMD (ours)	1.0x	1.0x

Experiment: extended Kalman filter (EKF)

Online learning of UCI datasets

Online training of neural networks on a corrupted version of the tabular UCI datasets. We consider a multilayered perceptron (MLP) with twenty hidden units, two hidden layers, and a real-valued output unit. We evaluate the median squared error (RMedSE) between the true and predicted output.

The SSM is $$ \begin{aligned} {\bm\theta_t} &= \bm\theta_{t-1} + \bm\phi_t\\ {\bm y}_t &= h_t(\bm\theta_t) + \bm\varphi_t, \end{aligned} $$
with $h_t(\bm\theta_t) = h(\bm\theta_t, {\bf x}_t) $ the MLP and ${\bf x}_t\in\mathbb{R}^m$ the input vector.

We estimate $ \mathbb{E}[\bm\theta_t \vert \bm y_{1:t}] $ via the extended Kalman filter (EKF) — one measurement, one parameter update. We modify the measurement mean with

$$ \mathbb{E}[\bm y_t \vert \bm\theta_t] \approx {\bf H}_t(\bm\theta_t - \bm\mu_{t | t-1}) + h_t(\bm\mu_{t | t-1}) =: \bar{\bm y}_t. $$

EKF Results

RMedSE versus time per step, relative to online gradient descent (OGD), acrross corrupted UCI datasets.

Experiment: ensemble Kalman filter (EnKF)

Data assimilation

The SSM is $$ \begin{aligned} \dot{\bm\theta}_{s,i} &= \Big(\bm\theta_{s, i+1} - \bm\theta_{s, i-2}\Big) \bm\theta_{s, i-1} - \bm\theta_{s,i} + \bm\phi_{s,i},\\ {\bm y}_{s,i} &= \begin{cases} \bm\theta_{s,i} + \bm\varphi_{s,i} & \text{w.p. } 1 - p_\epsilon,\\ 100 & \text{w.p. } p_\epsilon. \end{cases} \end{aligned} $$

Here, $\bm\theta_{s,k}$ is the value of the state component $k$ at step $s$, $\bm\phi_{s,i} \sim {\cal N}(8, 1)$, $\bm\varphi_{s,i} \sim {\cal N}(0, 1)$, $p_\epsilon = 0.001$, $i = 1, \ldots, d$, $s = 1, \ldots, S$, with $S \gg 1$ the number of steps, and $\bm\theta_{s, d + k} = \bm\theta_{s, k}$, $\bm\theta_{s, -k} = \bm\theta_{s, d - k}$. We consider the metric $L_t = \sqrt{\frac{1}{d}(\bm\theta_t - \bm\mu_t)^\intercal (\bm\theta_t - \bm\mu_t)}$.

EnKF results

Bootstrap estimate of $L_T$ over 20 runs and 500 samples as a function of the $c$ hyperparameter.

Method	Time	#HP	Ref
KF	\(O(p^3)\)	0	Kalman1960
KF-B	\(O(I\,p^3)\)	3	Wang2018
KF-IW	\(O(I\,p^3)\)	2	Agamennoni2012
OGD	\(O(I\, p^2)\)	2	Bencomo2023
WoLF-IMQ	\(O(p^3)\)	1	(Ours)
WoLF-TMD	\(O(p^3)\)	1	(Ours)