Filtering notes (I): signal plus noise models

Posted on Jan 6, 2025

tl;dr: Essentials of signal-plus-noise models: best linear unbiased predictors, innovations, filtering, prediction, smoothing, and fixed-lag smoothing.

Introduction

From tracking epidemics and predicting financial markets to controlling complex systems like rockets and personalising online experiences, many real-world problems rely on understanding and processing sequential data. In machine learning, this includes classes of problems such as online (continual) learning, one-step-ahead (prequential) forecasting, contextual bandits, and reinforcement learning.

Filtering techniques, such as the Kalman Filter, have a long and successful history of processing sequential data in fields like aerospace engineering and signal processing. In this series, we introduce filtering as a general framework for learning from sequential data and solving sequential decision-making problems in machine learning. Because we will develop the framework from the ground-up, the only requirements are basic concepts in linear algebra, calculus, and statistics.

In this first part, we explore the foundational mathematical tools needed to tackle sequential problems from a filtering perspective. We start by examining signal-plus-noise models and finding their best linear unbiased predictors (BLUPs) — a statistical estimate of a signal given noisy measurements. Then, we introduce the concept of innovations, a key idea that simplifies BLUP computation. Building on this, we show how fundamental operations — such as filtering prediction, smoothing, and fixed-lag smoothing — arise naturally when applying BLUPs across different timeframes. Finally, we show how to carry out the computations discussed in this post when the signal-plus-noise process is sampled from a noisy Lotka-Volterra model.

Signal plus noise models — the univariate case

The most fundamental assumption in all methods the techniques and models that we will study is the signal plus noise (SPN) assumption. The SPN assumption models a random process of univariate real-valued observations (or measurements) $Y_{1:T} = (Y_1, \ldots, Y_T)^\intercal$ as the sum of two components: a signal process $F_{1:T} \in \mathbb{R}^T$ and a noise process ${E}_{1:T} \in \mathbb{R}^T$. For example, if $Y_t$ is the return of a stock at time $t$, $F_t$ is the signal, alpha, or trend of the stock, and $E_t$ is market noise. We write a model with SPN assumption as $$ \tag{S.1} \underbrace{Y_{1:T}}_\text{measurement} = \underbrace{F_{1:T}}_{\text{signal}} + \underbrace{{E}_{1:T}}_{\text{noise}}. $$ For a SPN process, we assume $$ \begin{aligned} \cov(F_t, E_j) = 0 & \quad \forall \quad t,j,\\ \cov(E_t, E_j) = 0 &\quad \forall\quad t \neq j, \end{aligned} $$ meaning that the noise term is not correlated to the signal and is not correlated to any other noise term. Here, ${\cal T} = \{1, \ldots, T\}$. Next, for simplicity, we assume $\mathbb{E}[F_t] = 0$ and $\mathbb{E}[{E}_t] = 0$.

The best linear unbiased predictor (BLUP)

Given a SPN process, our goal is to determine the value of the signal $F_t$ given a sequence of observations $Y_{1:j}$.

A linear predictor

Let $F_t \in \mathbb{R}$ be the target signal and let $Y_{1:j} = (Y_1, \ldots, Y_j)^\intercal \in \mathbb{R}^{j}$ be a subset of the observations; here $(j,t)\in{\cal T}^2$. Consider a linear predictor parametrised by a vector $\va \in \reals^j$ that estimates the signal $F_t$ given the observations $Y_{1:j}$: $$ \va^\intercal\,Y_{1:j}. $$

The best linear predictor

For mathematical convenience, our choice of $\va\in{\mathbb R}^{j}$ is determined by minimising the expected squared error between the target signal $F_t$ and the linear prediction $\va^\intercal\,Y_{1:j}$ with respect to $\va$, that is, $$ \va^* = \argmin_\va\mathbb{E}\left[\left(F_t - \va^\intercal\,Y_{1:j}\right)^2\right]. $$

Having found $\va^*$, the transformation $F_{t|j} := (\va^*)^\intercal\,Y_{1:j}$ is called best linear unbiased predictor (BLUP) of the signal $F_t$ given measurements $Y_{1:j}$ — we formalise the idea of the BLUP in the proposition below.

Proposition 1 — best linear unbiased predictor (BLUP)

Let $(j,t)\in{\cal T}^2$. The vector $\va^*$ that minimises the expected squared error between the signal random variable $F_{t}$ and the subset of the measurement process $Y_{1:j}$ takes the form $$ \va^* = \argmin_\va\mathbb{E}\left[\left(F_t - \va^\intercal\,Y_{1:j}\right)^2\right] = \var(Y_{1:j})^{-1}\,\cov(Y_{1:j}, F_t). $$ Then, the best linear unbiased predictor (BLUP) for the signal $F_t$ given $Y_{1:j}$ is $$ \begin{aligned} \tag{BLUP.1} F_{t|j} &= (\va^*)^\intercal\,Y_{1:j} \\ &= \cov(F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}. \end{aligned} $$

Furthermore, the error variance-covariance matrix of the BLUP is $$ \tag{EVC.1} \var(F_t - F_{t|j}) = \var(F_t) - \va^{*\intercal}\,\var(Y_{1:j})\,\va^*. $$

See the Appendix for a proof.

The innovation process

If $\cov(F_t, Y_{1:j})$ and $\var(Y_{1:j})$ are known, and we are given a sample of the measurement process $Y_{1:j}$, then ${\rm (BLUP.1)}$ is the optimal way to extract the signal from the signal-plus-noise process. Whenever $j=t$, and we vary $j$, we call this procedure filtering. We come back to this point below.

In practice, however, estimating ${\rm (BLUP.1)}$ becomes computationally prohibitive because of the term $\var(Y_{1:j})^{-1}$ that we have to invert at each timeframe $j$ of interest. This imposes a computational cost of $O(j^3)$ for every $j$ of interest, rendering the use of ${\rm (BLUP.1)}$ impractical for large $j$.

To overcome this computational bottleneck, we develop a process derived from the measurements that decorrelates the observations. This process, referred to as the innovation process, provides an alternative formulation of $(\text{BLUP.1})$ with a computational complexity that scales linearly with time.

Definition: innovation

The innovation random variable ${\cal E}_t$, derived from the measurement random variable $Y_{t}$ and the innovation process ${\cal E}_{1:t-1}$, is defined by $$ \tag{I.1} {\cal E}_t = \begin{cases} Y_1 & \text{for } t = 1,\\ Y_t - \sum_{k=1}^{t-1} \frac{\cov(Y_t, {\cal E}_k)}{\var({\cal E}_k)}\,{\cal E}_k & \text{for } t \geq 2. \end{cases} $$ As we see, the innovation process ${\cal E}_{1:t}$ is built sequentially.

Innovations have various properties that allows the working with $(\text{BLUP.1})$ much more tractable, we provide some of these properties in the proposition below.

Proposition 2 — properties of innovations

Let $Y_{1:T}$ be an SPN random process and let ${\cal E}_{1:T}$ be the innovation process derived from $Y_{1:T}$. The following properties hold:

1. Innovations are uncorrelated
$$ \tag{I.2} \cov({\cal E}_t, {\cal E}_j)= \var({\cal E}_t)\,\mathbb{1}(t = j). $$

2. Measurements as linear functions of innovations
The measurement process and the innovation process satisfy the relationship $$ \tag{I.3} Y_{1:T} = \vL\,{\cal E}_{1:T}, $$ with $\vL$ a $(T\times T)$ lower-triangular matrix with elements $$ \tag{I.4} \vL_{t,j} = \begin{cases} \frac{\cov(Y_t, {\cal E}_j)}{\var({\cal E}_j)} & \text{if } j < t, \\ \vI & \text{if } j = t, \\ {\bf 0 } & \text{if } j > t. \end{cases} $$

3. Cholesky decomposition of measurements
The variance of the measurement process satisfies $$ \tag{I.5} \var(Y_{1:T}) = \vL\,\vS\,\vL^\intercal, $$ with $\vS = {\rm diag}(\var({\cal E}_1), \ldots, \var({\cal E}_{T}))$. This corresponds a Cholesky decomposition of the variance matrix for the measurements.

For a proof, see the Appendix.

The BLUP under innovations

Proposition 3 — additive BLUP

Take $(t,j) \in {\cal T}^2$ and let ${\cal E}_{1:j}$ be the innovation process derived from the measurement process $Y_{1:j}$. Then, the BLUP of the signal $F_t$ given the measurement process $Y_{1:j}$ is written as a weighted sum of the first $j$ innovations weighted by gain values: $$ \tag{BLUP.2} \boxed{ \begin{aligned} F_{t|j} &= \sum_{k=1}^j \hat{K}_{t,k}\,{\cal E}_k,\\ \hat{K}_{t,k} &= \frac{\cov(F_t, {\cal E}_k)}{\var({\cal E}_k)}. \end{aligned} } $$ Here, $\hat{K}_{t,k}$ is the gain value of the signal $F_t$ given the innovation ${\cal E}_k$ — this term is a signal-to-noise ratio between the innovation at time $k$ and the signal at time $t$, relative to the variance of the innovation at time $k$.

Furthermore, the error-variance covariance matrix between the signal and the BLUP takes the form $$ \begin{aligned} \tag{EVC.2} \var(F_t - F_{t|j}) % = \var(F_t) - \sum_{k=1}^j \hat{K}_{t,k}\,\vS_k\,\hat{K}_{t,k}^\intercal. = \var(F_t) - \sum_{k=1}^j \hat{K}_{t,k}^2\,\var({\cal E}_k). \end{aligned} $$

For a proof, see the Appendix.

Equation $(\text{BLUP.2})$ highlights a crucial property of working with innovations: the number of computations required to estimate $F_{t|j}$ becomes linear in time — computing $(\text{BLUP.2})$ only requires $O(j)$ operations.

It is important to note that the cubic complexity has not vanished; rather, it has been shifted from the computation of the BLUP to the computation of the innovations, which is achieved through the Cholesky decomposition. As we will later see, introducing additional assumptions about the signal process, further reduces the computational burden to estimate the BLUP.

Filtering, prediction, smoothing, and fixed-lag smoothing

Having a target signal $F_t$, the BLUP estimate $F_{t|j}$ of the form $({\rm BLUP.2})$ performs a weighted sum of innovations weighted by its corresponding signal-to-noise ratios (gain values). In this formulation, the only degree of freedom is the value of $j$ — the “frame of reference”. This value determines how much information gets incorporated to estimate the signal at time $t$. As a result, different choices of $j$ lead to different BLUP estimates, reflecting the varying amounts of information used to estimate the signal.

Broadly speaking, the choices for the frame of reference $j$ can be categorised as follows:

Filtering ($j=t$): use all available observations up to time $t$.
Prediction ($j < t$): estimate future signals based on past observations.
Smoothing ($j = T$): refine estimates by incorporating all observations.
Fixed-lag smoothing ($j = t + L$): refine estimates based on a window $L$ of future observations.

We describe these quantities in more detail below.

Filtering

The term filtering refers to the action of filtering-out the noise $E_t$ to estimate the signal $F_t$. Filtering is defined as $$ \tag{F.1} \boxed{ F_{t | t} = \sum_{k=1}^t \hat{K}_{t,k}\,{\cal E}_k. } $$ This quantity is one of the most important in the signal-processing theory.

We exemplify filtering in the table below. The top row shows the point in time in which the BLUP is estimated; the middle row shows in $\color{#00B7EB}\text{cyan}$ the unknown target signal; and the bottom row shows in $\color{orange}\text{orange}$ the measurements that we use to estimate the BLUP.

$$ \begin{array}{c|ccccc} \text{BLUP} & & & & F_{t|t} & & & \\ \text{signal} & F_{t-3} & F_{t-2} & F_{t-1} & \color{#00B7EB}{F_{t}} & F_{t+1} & F_{t+2} & F_{t+3}\\ \text{measurement} & \color{orange}{Y_{t-3}} & \color{orange}{Y_{t-2}} & \color{orange}{Y_{t-1}} & \color{orange}{Y_{t}} & Y_{t+1} & Y_{t+2} & Y_{t+3}\\ \hline \text{time} & t-3 & t-2 & t-1 & t & t+1 & t+2 & t+3 \end{array} $$ Filtering considers all measurements up to time $t$ to make an estimate of the signal at time $t$. Because we only the latest measurements up to time $t$ to estimate the signal at time $t$, it is said that filtering is online.

Prediction

This quantity estimates the future signal $F_{t+i}$, given the measurements $Y_{1:t}$. Here, $i \geq 1$. We define an $i$-th step ahead prediction as $$ \tag{F.2} \boxed{ % f_{t + i |t} = \sum_{k=1}^{t} \cov(f_{t+1}, \varepsilon_k)\,S_k^{-1}\,\varepsilon_k. F_{t + i | t} = \sum_{k=1}^t \hat{K}_{t+i,k}\,{\cal E}_k. } $$

We exemplify a two-step-ahead prediction in the table below $$ \begin{array}{c|ccccc} \text{BLUP} & & & & F_{t+2|t} & & & \\ \text{signal} & F_{t-3} & F_{t-2} & F_{t-1} & F_{t} & F_{t+1} & \color{#00B7EB}{F_{t+2}} & F_{t+3}\\ \text{measurement} & \color{orange}{Y_{t-3}} & \color{orange}{Y_{t-2}} & \color{orange}{Y_{t-1}} & \color{orange}{Y_{t}} & Y_{t+1} & Y_{t+2} & Y_{t+3}\\ \hline \text{time} & t-3 & t-2 & t-1 & t & t+1 & t+2 & t+3 \end{array} $$ As shown in the table above, a two-step-ahead prediction considers all measurements up to time $t$ to make an estimate of the signal at time $t+2$.

Smoothing

This quantity estimates the signal $F_t$ having seen all measurements $Y_{1:T}$. We define smoothing as $$ \tag{F.3} \boxed{ % f_{t | T} = \sum_{k=1}^T \cov(f_t, \varepsilon_k)\,S_k^{-1}\,\varepsilon_k. F_{t | T} = \sum_{k=1}^T \hat{K}_{t,k}\,{\cal E}_k. } $$

We exemplify smoothing at time $t$ in the table below $$ \begin{array}{c|ccccc} \text{BLUP} & & & & & & & F_{t|T} \\ \text{signal} & F_{t-3} & F_{t-2} & F_{t-1} & \color{#00B7EB}{F_{t}} & F_{t+1} & \ldots & F_T\\ \text{measurement} & \color{orange}{Y_{t-3}} & \color{orange}{Y_{t-2}} & \color{orange}{Y_{t-1}} & \color{orange}{Y_{t}} & \color{orange}{Y_{t+1}} & \ldots & \color{orange}{Y_{T}}\\ \hline \text{time} & t-3 & t-2 & t-1 & t & t+1 & \ldots & T \end{array} $$ Contrary to the filtering equation $(\text{F.1})$, which is online, the smoothing operation requires all information up to time $T$ to make an estimate of the signal at time $t$. In this sense, smoothing is offline.

Fixed-lag smoothing

This quantity is a middle ground between filtering, which is online, and smoothing, which is offline. The idea behind an $i$-step fixed-lag smoothing is to estimate the signal $F_t$ having $Y_{1:t+i}$. That is, we must wait $i$ steps, before making an estimate of the signal $F_t$. $$ \tag{F.4} \boxed{ % f_{t | t + i} = \sum_{k=1}^{t+i} \cov(f_t, \varepsilon_k)\,S_k^{-1}\,\varepsilon_k. F_{t | t + i} = \sum_{k=1}^{t+i} \hat{K}_{t,k}\,{\cal E}_k. } $$

We exemplify a two-step fixed-lag smoothing below $$ \begin{array}{c|ccccc} \text{BLUP} & & & & & & F_{t|t+2} & \\ \text{signal} & F_{t-3} & F_{t-2} & F_{t-1} & \color{#00B7EB}{F_{t}} & F_{t+1} & F_{t+2} & F_{t+3}\\ \text{measurement} & \color{orange}{Y_{t-3}} & \color{orange}{Y_{t-2}} & \color{orange}{Y_{t-1}} & \color{orange}{Y_{t}} & \color{orange}{Y_{t+1}} & \color{orange}{Y_{t+2}} & Y_{t+3}\\ \hline \text{time} & t-3 & t-2 & t-1 & t & t+1 & t+2 & t+3 \end{array} $$ We observe that we require information up to time $t+2$ to make a prediction of the signal at time $t$.

Signal plus noise models – the multivariate case*

We generalise the ideas of the previous section by allowing the observations to be $D$-dimensional, i.e., $Y_t \in \mathbb{R}^D$ with $D > 1$. The multivariate SPN model is composed of a random observation matrix $Y_{1:T}$, a random signal matrix $F_{1:t}$, and a noise matrix $E_{1:t}$, all living in $\in \mathbb{R}^{T\times D}$.

In the multivariate setting, computations of the BLUP amounts to simply making an additional indexing over all dimensions (components).

The multivariate BLUP

We begin by writing the BLUP in the multivariate case. In this scenario, we seek a matrix $\vA^* \in \mathbb{R}^{d\times j}$ such that $$ \begin{aligned} \vA^* &= \argmin_\vA \mathbb{E}\left[\|F_{t} - \text{diag}(\vA\,Y_{1:j})\|^2\right] \\ &= \argmin_{\va_1, \ldots, \va_D} \sum_{d=1}^D \mathbb{E}\left[(F_{t,d} - (\va_d)^\intercal\,Y_{1:j,d})^2\right], \end{aligned} $$ where $\va_d \in \mathbb{R}^{j}$ is the $d$-th column of $\vA$.

Optimising for the $d$-th component, we obtain $$ \va^*_d = \argmin_{\bf a_d}\mathbb{E}\left[\|F_{t,d} - (\va_{d})^\intercal\,Y_{1:j,d}\|^2\right] = \var(Y_{1:j,d})^{-1}\,\cov(Y_{1:j,d}, F_{t,d}). $$

As a consequence, the BLUP for the $d$-th component takes the form $$ F_{t|j,d} = (\va_d^*)^\intercal\,Y_{1:j,d}. $$

The multivariate innovations

Because the BLUP $F_{t|j}$ only depends on observations in the $d$-th component, we generalise the idea of an innovation to multivariate random processes by simply computing the cholesky decomposition of the variance of the observations at the $d$-th component. For the $d$-th component of the observation process, we have $$ \tag{I.6} \var(Y_{1:j,d}) = \vL_d\,\vS_{d}\,\vL_d^\intercal. $$

So that $$ \tag{I.7} Y_{1:j,d} = \vL_{d}\,{\cal E}_{1:j,d} \iff {\cal E}_{1:j,d} = \vL_d^{-1}\,Y_{1:j,d}. $$

The multivariate BLUP under innovations

Next, we generalise the gain values to the multidimensional case. Let $$ \tag{G.2} \hat{K}_{t,k,d} = \frac{\cov(F_{t,d}, {\cal E}_{k,d})}{\var({\cal E}_{k,d})} $$ be the gain of the $d$-th component for the signal at time $t$ and innovation at time $k$.

As before, the BLUP of the signal $F_{t,d}$ given the measurement process $Y_{1:j,d}$ is written as a weighted sum of innovations weighted by gain values: $$ \tag{BLUP.3} F_{t|j,d} = \sum_{k=1}^j \hat{K}_{t,k,d}\,{\cal E}_{k,d}. $$

For the error variance-covariance matrix at time $t$ using the BLUP at time $j$, we have $$ \tag{EVC.3} \begin{aligned} \var(F_{t,d} - F_{t|j,d}) % = \var(F_{t,d}) - \sum_{k=1}^j \hat{K}_{t,k,d}\,\vS_{k,d}\,\hat{K}_{t,k,d}^\intercal = \var(F_{t,d}) - \sum_{k=1}^j \hat{K}_{t,k,d}^2\,\vS_{k,d}. \end{aligned} $$

Stacking

Finally, the multivariate BLUP is found by stacking each unidimensional BLUP row-wise: $$ F_{t|j} = \begin{bmatrix} {F}_{t|j,1}\\ \vdots \\ {F}_{t|j,D}\\ \end{bmatrix} = \begin{bmatrix} (\va_{1}^*)^\intercal\,Y_{1:j, 1} \\ \vdots\\ (\va_{D}^*)^\intercal\,Y_{1:j, D} \\ \end{bmatrix}. $$

A machine learning approach to estimate the BLUP — from random variables to samples

In previous sections, we showed that the observations $Y_{1:T}$ in a signal-plus-noise process, can be transformed into a decorrelated innovation process ${\cal E}_{1:T}$. The innovation process allows us to compute the best linear unbiased predictor (BLUP) for the signal at time $t$ based on information available up to time $j$ — denoted $F_{t|j}$ — in a way that is linear with respect to $j$.

Now, let’s assume we have a training phase where we observe samples for both the signal and the noise processes. With these samples, we estimate the quantities $\vL$ and $\hat{K}_{t,k,d}$ for $(t,k)\in{\cal T}^2$.

Furthermore, assume we have a test phase, where we are only given a sample of the observation process, which we denote $y_{1:T}$. Because in the test phase do not have direct access to either the signal or noise processes, we must rely on estimates derived from the BLUP formula $(\text{BLUP.2})$ using the measurement $y_{1:j}$. To do this,

Transform the measurements to innovations using equation $(\text{I.3})$: $\varepsilon_{1:T} = \vL^{-1}\, y_{1:T}$.
Estimate the BLUP for the signal at time $t$, based on the innovations up to time $j$ following $(\text{BLUP.2})$: $f_{t|j} = \sum_{k=1}^j \hat{K}_{t,k},\varepsilon_k$.

We provide a detailed example below.

Example: a data-driven BLUP

In this example, we consider a data-driven approach to estimate the BLUP over multiple frames of reference. We divide this experiment into a train phase where the quantities $\cov(F_t, {\cal E}_k)$, $\cov(Y_t, {\cal E}_k)$, $\vS_k = \var({\cal E}_k)$, and $\vL$ are estimated; and a test phase where the BLUP estimates $(\text{F.1 – F.4})$ are are obtained from sampled measurement $y_{1:T}$ not seen in the train phase.

For this experiment, we make use of the numpy library, which includes the einsum operator, and the einops library. For a review of the np.einsum function (which we use extensively throughout this experiment) see the post einsums in the wild.

Discretised noisy Lotka-Volterra model

We define the discretised noisy Lotka-Volterra model as the following SPN process $$ \tag{LV.1} \begin{aligned} \frac{f_{t + \Delta, 1} - f_{t, 1}}{\Delta} &= \alpha\,f_{t,1} - \beta\,f_{t,1}\,f_{t,2} + \phi_{t,1},\\ \frac{f_{t + \Delta, 2} - f_{t, 2}}{\Delta} &= \delta\,f_{t,1}\,f_{t,2} - \gamma\,f_{t,2} + \phi_{t, 2},\\ y_{t + \Delta, 1} &= f_{t + \Delta, 1} + \varphi_{t,1},\\ y_{t + \Delta, 2} &= f_{t + \Delta, 2} + \varphi_{t,2}, \end{aligned} $$

with $\phi_{t,i} \sim {\cal N}(0, \sigma_f^2 / \Delta)$, $\varphi_{t,j} \sim {\cal N}(0, \sigma_y^2)$, $(i,j) \in {0,1}^2$, $\sigma_f^2, \sigma_y^2 > 0$, $\Delta \in (0, 1)$, and $\alpha, \beta, \gamma, \delta$ values in the $(0,1)$ interval.

The setup

Consider samples of the process $(\text{LV.1})$ with the following parameters: $\alpha = 2/3$, $\beta = 4/3$, $\gamma = 0.8$, $\delta = 1.0$, $\Delta = 0.01$, $\sigma_f^2 = 0.02^2$, and $\sigma_y^2 = 0.1^2$. We integrate the system for $T=1500$ steps, each starting at $(f_{0,1}, f_{0,2}) = (1.0, 1.0) + (u_1, u_2)$, with $u_i \sim \cal{U}[-0.2, 0.2]$.

The following plot shows multiple samples of of this process. The black line shows the signal $(f_{t,1}, f_{t,2})$ and the coloured line shows the measurements $(y_{t,1}, y_{t,2})$ for $t=1, \ldots, T.$ sample-process Our task is to estimate the black lines given the coloured lines.

Train phase

In the train phase, we consider $N=2000$ samples of $(\text{LV.1})$ following the configuration outlined above. To enforce the constraint, $\mathbb{E}[F_t] = 0$, we de-mean the samples.

Programmatically, assume we have a numpy array of measurements y_sims and a numpy array of signals f_sims with f_sims.shape == y_sims.shape == (1500, 2, 2000) corresponding to the 1500 steps of the process, two dimensions, and 2000 samples. Mathematically, we denote $y_{t,d}^{(s)}$ the $d$-th dimension of the $s$-th training observed sample at time $t$.

Computation of innovations

We begin by estimating the matrices $\vL_d$. Recall from $(\text{I.6})$ that $\var(Y_{1:T,d}) = \vL_d\,\vS_d\,\vL_d^\intercal$. Then, to estimate $\vL_d$ and $\vS_d$, we first approximate the variance of the measurement process as follows: $$ \begin{aligned} \var(Y_{1:T,d}) &= \mathbb{E}[(Y_{1:T,d} - \hat{Y}_{1:T,d})(Y_{1:T,d} - \hat{Y}_{1:T,d})^\intercal]\\ &\approx \frac{1}{N}\sum_{s=1}^N \left(y_{1:T,d}^{(s)} - \bar{y}_{1:T,d}\right)\,\left(y_{1:T,d}^{(s)} - \bar{y}_{1:T,d}\right)^\intercal\\ &= \frac{1}{N}\sum_{s=1}^N \left(y_{1:T,d}^{(s)}\right)\,\left(y_{1:T,d}^{(s)}\right)^\intercal\\ \end{aligned} $$ where we make use of the fact that the sample mean is zero by construction, i.e., $\bar{y}_{1:T,d} = \frac{1}{N}\sum_s y_{1:T,d}^{(s)} = 0$. In code, the variance of the measurements is estimated by

1
V = np.einsum("tds,kds->dtk", y_sims, y_sims) / (n_sims - 1)

Next, the terms $\vL_d$ and $\vS_d$ for all $d=1,\ldots,D$ are estimated through a Cholesky decomposition of the sample variance. This is done as follows

1
2
3
4
L = np.linalg.cholesky(V) # Cholesky decomposition of the variance
S = np.einsum("dtt->dt", L) # S-terms are diagonal of the output
L = np.einsum("dtk,dk->dtk", L, 1 / S) # Make diagonal be 1-valued 
S = S ** 2 # Make variance

Finally, we estimate the innovations derived from the samples y_sims and the matrix L following $({\rm I.7})$.

1
2
3
ve_sims = einops.rearrange(y_sims, "t d s -> d t s") # Place in dims to solve system
ve_sims = np.linalg.solve(L, ve_sims) # Solve system, obtain innovations
ve_sims = einops.rearrange(ve_sims, "d t s -> t d s") # Match dimension ordering of measurements

Computation of the gain values

Having estimated the innovations, the next step is to compute the gain values. Following $({\rm G.2})$, we have $$ \begin{aligned} \hat{K}_{t,k,d} &= \cov(F_{t,d}, {\cal E}_{k,d})\,\var({\cal E}_{k,d})^{-1}\\ &\approx\left(\frac{1}{N}\sum_{s=1}^N\left(f_{t,d}^{(s)} - \bar{f}_{t,d}\right)\,\left(\varepsilon_{k,d}^{(s)} - \bar{\varepsilon}_{k,d}\right)\right)\,\vS_{k,d}^{-1}\\ &\approx\frac{1}{N}\sum_{s=1}^N\frac{f_{t,d}^{(s)} \, \varepsilon_{k,d}^{(s)}}{\vS_{k,d}}, \end{aligned} $$ where $\bar{f}_{t,d} = \bar{\varepsilon}_{k,d} = 0$ for all $(t,k) \in {\cal T}^2$. In code, this is done as follows

1
K = np.einsum("tds,kds,dk->tkd", f_sims, ve_sims, 1 / S) / (n_sims - 1)

The next Figure shows the gain matrices $\vK_{t,k,d}$ for $d=1,2$, corresponding to the $x$ and $y$ coordinates of the process. kalman-gain

We observe that in each panel, the values below the diagonal have non-zero values for $K_{t,k,i}$, whereas values above the diagonal are close to zero. This suggests that, for any two points in time $t$ and $k$, with $t$ being the target index and $k$ the frame of reference, the information that $k$ brings to inform the timestep $j$ is non-zero if $t \geq k$. Conversely, because our SPN process runs forward in time, for $t < k$, how much $k$ is used to inform $t$ is close to zero.

Computation of the error variance-covariance matrix

For the error variance-covariance matrix at time $t$ using the BLUP at time $j$, we have $$ \begin{aligned} \var(F_{t,d} - F_{t|j,d}) = \var(F_{t,d}) - \sum_{k=1}^j \hat{K}_{t,k,d}\,\vS_{k,d}\,\hat{K}_{t,k,d}^\intercal \end{aligned} $$ where $$ \begin{aligned} \var(F_{1:T,d}) &= \mathbb{E}[(F_{1:T,d} - \bar{F}_{1:T,d})(F_{1:T,d} - \bar{F}_{1:T,d})^\intercal]\\ &\approx \frac{1}{N}\sum_{s=1}^N \left(f_{1:T,d}^{(s)} - \bar{f}_{1:T,d}\right)\,\left(f_{1:T,d}^{(s)} - \bar{f}_{1:T,d}\right)^\intercal\\ &= \frac{1}{N}\sum_{s=1}^N \left(f_{1:T,d}^{(s)}\right)\,\left(f_{1:T,d}^{(s)}\right)^\intercal. \end{aligned} $$

In code, this is estimated as

1
var_signal = np.einsum("tds,kds->dtk", f_sims, f_sims) / n_sims

Having var_signal, we compute the correlation matrix of the signal process using

1
2
std_var = np.sqrt(var_signal)
cov_signal = np.einsum("dij,dii,djj->dij", var_signal, 1 / std_var, 1 / std_var)

The next Figure shows the correlation matrix of the signal process. correlation-matrix-for-signal

Test phase

Having computed K, L, and var_signal, we now evaluate the BLUP estimates $(\text{F.1 – F.4})$ on an unseen run ve_test_sim with ve_test_sim.shape == (1500, 2). Furthermore, we also evaluate error-bounds around our estimate of $f$. We consider the run below. sample-run

Filter

Here, we compute the BLUP corresponding to the filter estimate, i.e., $k=t$. Because we assume that we have access to $y_{1:T}$, we compute the filter estimate by use of masking, i.e., $$ \begin{aligned} f_{t|T,d} &= \sum_{k=1}^t\hat{K}_{t,k,d}\,\varepsilon_{k,d}\\ &= \sum_{k=1}^T\hat{K}_{t,k,d}\,\varepsilon_{k,d}\,\vone(k \leq t). \end{aligned} $$

This takes the form

1
2
tmask = np.tril(np.ones((T, T)), k=0)
latent_filter = np.einsum("tkd,kd,tk->td", K, ve_test_sim, tmask)

Next, the standard error of the blup takes the form

1
2
3
4
5
std_filter = (
    np.einsum("dtt->td", var_signal) -
    np.einsum("tkd,dk,ktd,tk->td", K, S, K, tmask)
)
std_filter = np.sqrt(std_filter)

The image below shows the filtering of the true (latent) signal (in black) and the filtered signal (in colour). Around the estimated signal, we plot a one standard deviation error in the estimation of the true position of the signal. This correponds to latent_filter ± std_filter. test-filter

Smoothing

Next, we consider the problem of smoothing. This is written as a straightforward modification of the code-block above, in which we remove the masking element. We obtain

1
latent_smooth = np.einsum("tkd,kd->td", K, ve_test_sim)

test-smooth We observe that, relative to the smoothing exercise, the recovered signal has much less variance.

Prediction

Here, we consider the problem of five-step-ahead prediction. In code, prediction is written as

1
2
tmask = np.tril(np.ones((n_steps, n_steps)), k=-5)
latent_pred = np.einsum("tkd,kd,tk->td", K, ve_test_sim, tmask)

test-smooth

Varying lag

Here, we plot the cumulative RMSE as a function of time $t$ and as a function of lag $k$. We define the cumulative RMSE at time $t$, under lag $k$ as $$ E_k(t) = \left(\frac{1}{t}\sum_{\tau=1}^t\|f_\tau - f_{\tau|k}\|^2\right)^{1/2} $$ The plot below shows $E_k(t)$ for varying $k$ and $t$. test-varying-lag-err The black line ($k=0$) corresponds to filtering. We observe that lags $k< 0$ have higher RMSE than $k=0$ — these correspond to prediction. On the other hand lags $k > 0$ have higher rmse than $k = 0$ — these correspond to fixed-lag smoothing.

Multiple simulations

Here, we repeat the above experiment for multiple samples. We run multiple simulations of $(\text{LV.1})$, compute $E_k(T)$ for each of the samples, and plot the final average RMSE across samples. test-varying-err As expected, prediction ($k < 0$) incurs in higher RMSE than fixed-lag smoothing ($k > 0$).

Conclusion

In this post, (i) we introduced signal plus noise models and their best linear unbiased predictions (BLUP), (ii) we introduced the concept of an innovation to decorrelate the measurements, and (iii) we made use of the innovations to arrive at linear-in-time formulas to compute the BLUP. We showed that an important quantity of the blup BLUP is its frame of reference. Depending on the frame of reference, we arrive at filtering, smoothing, prediction, and fixed-lag smoothing.

A summary — BLUP for a unidimensional signal plus noise model

Define the signal plus noise (SPN) model — $Y_{1:T} = F_{1:T} + E_{1:T}$,
define the linear predictor — $\va^\intercal\,Y_{1:j}$
write objective function — $\va^* = \argmin_\va\mathbb{E}\left[\left(F_t - \va^\intercal\,Y_{1:j}\right)^2\right] = \var(Y_{1:j})^{-1}\,\cov(Y_{1:j}, F_t)$
find the best linear unbiased predictor (BLUP) — $F_{t|j} = \cov(F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j} $
write observations as innovations — $Y_{1:j} = \vL\,{\cal E}_{1:j}$, and
rewrite the BLUP using innovations — $F_{t|j} = \sum_{k=1}^j \hat{K}_{t,k}\,{\cal E}_k$ with $\hat{K}_{t,j} = \tfrac{\cov(F_t, {\cal E}_j)}{\var({\cal E}_j)}$.

Appendix

Proof of proposition 1

Here, we provide a proof of Proposition 1.

Let $$ {\cal L}(\va) = \mathbb{E}\left[\|F_t - \va^\intercal\,Y_{1:j}\|^2\right]. $$ Then, $$ \begin{aligned} \nabla_{\va^\intercal}\,{\cal L}(\va^\intercal) &= 2\,\mathbb{E}\left[(F_t - \va^\intercal\,Y_{1:j})\,Y_{1:j}^\intercal\right]\\ &= 2\,\left( \mathbb{E}\left[F_t\,Y_{1:j}^\intercal\right]-\mathbb{E}[\va^\intercal\,Y_{1:j}\,Y_{1:j}^\intercal]\right)\\ &= 2\,\left( \mathbb{E}\left[F_t\,Y_{1:j}^\intercal\right]-\va^\intercal\,\mathbb{E}[Y_{1:j}\,Y_{1:j}^\intercal]\right)\\ &= 2\,\left( \cov (F_t, Y_{1:j}) - \va^\intercal\,\var(Y_{1:j})\right) \end{aligned} $$ Setting the last term to zero and solving for $\va$ yields $(\va^*)^\intercal = \cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}$, as required.

Next, for the error variance-covariance matrix, we have $$ \begin{aligned} &\var(F_t - F_{t|j})\\ &=\mathbb{E}\left[(F_t - F_{t|j})^2\right]\\ &=\mathbb{E}\left[F_t^2 - 2\,F_{t|j}\,F_t + F_{t|j}^2\right]\\ &=\mathbb{E}\left[F_t^2 - 2\,(\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j})\,F_t + (\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j})^2\right]\\ &=\mathbb{E}\left[F_t^2\right] - \mathbb{E}\left[2\,\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}\,F_t\right] + \mathbb{E}\left[(\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j})^2\right]\\ &=\mathbb{E}\left[F_t^2\right] - \mathbb{E}\left[2\,\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}\,F_t\right] + \mathbb{E}\left[\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}\,Y_{1:j}\,\var(Y_{1:j})^{-1}\cov(Y_{1:j}, F_t)\right]\\ &= \var(F_t) - 2\,(\va^*)^\intercal\,\cov(Y_{1:j}, F_t) + (\va^*)^\intercal\,\var(Y_{1:j})\,\va^*\\ &= \var(F_t) - 2\,(\va^*)^\intercal\,\var(Y_{1:j})\,\var(Y_{1:j})^{-1}\,\cov(Y_{1:j}, F_t) + (\va^*)^\intercal\,\var(Y_{1:j})\,\va^*\\ &= \var(F_t) - 2\,(\va^*)^\intercal\,\var(Y_{1:j})\,\va^* + (\va^*)^\intercal\,\var(Y_{1:j})\,\va^*\\ &= \var(F_t) - \,(\va^*)^\intercal\,\var(Y_{1:j})\,\va^*. \end{aligned} $$

Proof of proposition 2

Here, we provide a proof of Proposition 2.

Proof of (I.2)

To show $(\text{I.2})$, first note that the diagonal terms $\cov({\cal E}_t, {\cal E}_t) = \var({\cal E}_t) = S_t$ for all $t \in {\cal T}$.

Next, we show that the off-diagonal terms are zero. Observe that $$ \begin{aligned} &\cov({\cal E}_1, {\cal E}_2)\\ &= \cov\Big(Y_1,\,Y_2 - \cov(Y_2,\,{\cal E}_1)S_1^{-1}{\cal E}_1\Big)\\ &= \cov(Y_1,\,Y_2) - \cov\Big(Y_1,\,\cov(Y_2,\,{\cal E}_1)S_1^{-1}{\cal E}_1\Big)\\ &= \cov(Y_1,\,Y_2) - \cov\Big(Y_1,\,\cov(Y_2,\,Y_1)S_1^{-1}Y_1\Big)\\ &= \cov(Y_1,\,Y_2) - \cov(Y_1, Y_1)S_1^{-1}\cov(Y_1\,Y_2)\\ &= \cov(Y_1,\,Y_2) - S_1\,S_1^{-1}\cov(Y_1\,Y_2)\\ &= 0. \end{aligned} $$ By symmetry of the covariance matrix, we obtain $\cov({\cal E}_2,\,{\cal E}_1) = (\cov({\cal E}_1,\,{\cal E}_2))^\intercal = 0$. A similar procedure shows that $\cov({\cal E}_1, {\cal E}_3) = \cov({\cal E}_3, {\cal E}_1) = 0$. The general case holds by induction:

Suppose that $$ \cov({\cal E}_i, {\cal E}_k) = 0 \ \text{for } i\geq 2,\,k=i+1,\ldots,j-1., $$ i.e., an upper-triangular (off-diagonal) assumption. We show $\cov({\cal E}_i, {\cal E}_j) = 0$.

By definition, $$ \begin{aligned} &\cov({\cal E}_i,\,{\cal E}_j)\\ &= \cov\left({\cal E}_i,\,Y_j - \sum_{k=1}^{j-1}\cov(Y_j, {\cal E}_k)S_k^{-1}{\cal E}_k\right)\\ &= \cov\left({\cal E}_i,\,Y_j - \sum_{\substack{k\neq i \\ {1 \leq k\leq j-1}}}\cov(Y_j,\,{\cal E}_k)S_k^{-1}{\cal E}_k - \cov(Y_j,\,{\cal E}_i)S_i^{-1}{\cal E}_i \right)\\ &=\cov({\cal E}_i,\, Y_j) - \sum_{\substack{k\neq i \\ {1 \leq k\leq j-1}}}\cov({\cal E}_i, {\cal E}_k)S_k^{-1}\cov({\cal E}_k\,Y_j) - \cov({\cal E}_i, {\cal E}_i)S_i^{-1}\,\cov({\cal E}_i, Y_j)\\ &= - \sum_{k=1}^{i-1}\cov({\cal E}_i, {\cal E}_k)S_k^{-1}\cov({\cal E}_k\,Y_j) - \sum_{k=i+1}^{j-1}\cov({\cal E}_i, {\cal E}_k)S_k^{-1}\cov({\cal E}_k\,Y_j)\\ &= - \sum_{k=1}^{i-1}(\cov({\cal E}_k, {\cal E}_i))^\intercal S_k^{-1}\cov({\cal E}_k\,Y_j) - \sum_{k=i+1}^{j-1}\cov({\cal E}_i, {\cal E}_k)S_k^{-1}\cov({\cal E}_k\,Y_j)\\ &= 0. \end{aligned} $$ Where the last equality follows from the assumption that $\cov({\cal E}_i, {\cal E}_k) = 0$ for $k = i + 1, \ldots, j - 1$.

Proof of (I.3)

Next, we show $(\text{I.3})$. By definition, the innovation at time $t$ is $$ {\cal E}_t = Y_t - \sum_{k=1}^{t-1}\cov(Y_t, {\cal E}_k)\,\vS_k^{-1}\,{\cal E}_j $$ Define the lower triangular matrix $\vL$ as in $(\text{I.4})$. Then $$ \begin{aligned} {\cal E}_t &= Y_t - \sum_{k=1}^{t-1}\vL_{t,k}\,{\cal E}_k\\ &= Y_t + \vL_{t,t}\,{\cal E}_t - \vL_{t,t}\,{\cal E}_t - \sum_{k=1}^{t-1}\vL_{t,k}\,{\cal E}_k\\ &= Y_t + \vL_{t,t}\,{\cal E}_t - \sum_{k=1}^{t}\vL_{t,k}\,{\cal E}_k\\ &= Y_t + {\cal E}_t - \sum_{k=1}^{t}\vL_{t,k}\,{\cal E}_k. \end{aligned} $$ The last equality corresponds to the $t$-th entry of the vector resulting from $(\vL\,{\cal E}_{1:T})_t$. So that ${\cal E}_t = Y_t + {\cal E}_t - (\vL\,{\cal E}_{1:T})_t$.

Finally, write the innovation vector as $$ \begin{aligned} &{\cal E}_{1:T} = Y_{1:T} + {\cal E}_{1:T} - \vL\,{\cal E}_{1:T}\\ \iff & 0 = Y_{1:T} - \vL\,{\cal E}_{1:T}\\ \iff & Y_{1:T} = \vL\,{\cal E}_{1:T}. \end{aligned} $$

Proof of (I.5)

Following $(\text{I.3})$, we have $$ \begin{aligned} \var(Y_{1:t}) &= \var(\vL\,{\cal E}_{1:T})\\ &= \cov(\vL\,{\cal E}_{1:T}, \vL\,{\cal E}_{1:T})\\ &= \vL\,\cov({\cal E}_{1:T}, {\cal E}_{1:T})\vL^\intercal\\ &= \vL\,\var({\cal E}_{1:T})\vL^\intercal\\ &= \vL\,\vS\,\vL^\intercal. \end{aligned} $$ Because $\vL$ is a lower triangular matrix, it follows that the last equality corresponds to the Cholesky decomposition of $\var(Y_{1:T})$ $$ \ \tag*{$\blacksquare$} $$

Proof of proposition 3

Here, we provide a detailed proof of Proposition 3.

Let $S_k = \var({\cal E}_k)$. Using $(\text{BLUP.1})$ and $(\text{I.2})$, we see that $$ \begin{aligned} (\va^*)^\intercal &= \cov(F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\\ &= \cov(F_t, \vL\,{\cal E}_{1:j})\,\var(\vL\,{\cal E}_{1:j})^{-1}\\ &= \cov(F_t, {\cal E}_{1:j})\,\vL^\intercal\,\{\vL\var({\cal E}_{1:j})\vL^\intercal\}^{-1}\\ &= \cov(F_t, {\cal E}_{1:j})\,\vL^\intercal\,\vL^{-\intercal}\, \var({\cal E}_{1:j})^{-1}\vL^{-1}\\ &= \cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:j})^{-1}\vL^{-1}\\ &= \cov(F_t, {\cal E}_{1:j})\, {\rm diag}(\var({\cal E}_1), \ldots, \var({\cal E}_j))^{-1}\vL^{-1}\\ &= \cov(F_t, {\cal E}_{1:j})\, {\rm diag}(S_1, \ldots, S_j)^{-1}\vL^{-1}. \end{aligned} $$ Then, the BLUP of the signal $F_t$ given $Y_{1:j}$ is $$ \begin{aligned} F_{t|j} &= (\va^*)^\intercal\,Y_{1:j}\\ &= \cov(F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}\\ &= \cov(F_t, {\cal E}_{1:j})\, {\rm diag}(S_1, \ldots, S_j)^{-1}\vL^{-1}\,\vL\,{\cal E}_{1:j}\\ &= \cov(F_t, {\cal E}_{1:j})\, {\rm diag}(S_1, \ldots, S_j)^{-1}\,{\cal E}_{1:j}\\ &= \cov(F_t, {\cal E}_{1:j})\, {\rm diag}(S_1^{-1}, \ldots, S_j^{-1})\,{\cal E}_{1:j}\\ &= \sum_{k=1}^j \cov(F_t, {\cal E}_k)\,S_k^{-1}\,{\cal E}_k\\ &= \sum_{k=1}^j \vK_{t,k}\,{\cal E}_k. \end{aligned} $$

Furthermore, the error variance-covariance matrix of the BLUP takes the form $$ \begin{aligned} &\var(F_t - F_{t|j})\\ &= \var(F_t) - (\va^*)^\intercal\,\var\,(Y_{1:j})\,\va^*\\ &= \var(F_t) - \cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\vL^{-1} \{\vL\var({\cal E}_{1:j})\vL^\intercal\} \left(\cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\vL^{-1}\right)^\intercal\\ &= \var(F_t) - \cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\vL^{-1} \vL\var({\cal E}_{1:j})\vL^\intercal \left(\cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\vL^{-1}\right)^\intercal\\ &= \var(F_t) - \cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\vL^{-1} \vL\var({\cal E}_{1:j})\vL^\intercal\, \vL^{-\intercal}\,\var({\cal E}_{1:t})^{-1}\,\cov({\cal E}_{1:j}, F_t)\\ &= \var(F_t) - \cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\var({\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\,\cov({\cal E}_{1:j}, F_t)\\ &= \var(F_t) - \cov(F_t, {\cal E}_{1:j})\, {\rm diag}(S_1, \ldots, S_j)^{-1} {\rm diag}(S_1, \ldots, S_j)\, {\rm diag}(S_1, \ldots, S_j)^{-1}\,\cov({\cal E}_{1:j}, F_t)\\ &= \var(F_t) - \sum_{k=1}^j\cov(F_t, {\cal E}_k)\,S_k^{-1}S_k\,S_k^{-1}\,\cov({\cal E}_k, F_t)\\ &= \var(F_t) - \sum_{k=1}^j \left(\cov(F_t, {\cal E}_k)\,S_k^{-1}\right)S_k\,\left(\cov(F_t, {\cal E}_k)\,S_k^{-1}\right)^\intercal\\ &= \var(F_t) - \sum_{k=1}^j {K}_{t,k}\,S_k\,{K}_{t,k}^\intercal\\ &= \var(F_t) - \sum_{k=1}^j {K}_{t,k}^2\,S_k. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$

References

Eubank, Randall L. A Kalman filter primer. Chapman and Hall/CRC, 2
Mohinder S Grewal and Angus P Andrews. Applications of kalman filtering in aerospace 1960 to the present [historical perspectives]. IEEE Control Systems Magazine, 30(3):69–78, 2010.
https://en.wikipedia.org/wiki/Lotka-Volterra_equations