Filtering notes (I): signal plus noise models
From tracking epidemics and predicting financial markets to controlling complex systems like rockets and personalising online experiences,
many real-world problems rely on understanding and processing sequential data.
In machine learning, this includes classes of problems such as
online (continual) learning,
one-step-ahead (prequential) forecasting,
contextual bandits,
and reinforcement learning. Filtering techniques, such as the Kalman Filter, have a long and successful history of processing sequential data
in fields like aerospace engineering and signal processing.
In this series,
we introduce filtering as a general framework for learning from sequential data and
solving sequential decision-making problems in machine learning.
Because we will develop the framework from the ground-up, the only requirements are basic concepts in linear algebra, calculus, and statistics. In this first part, we explore the foundational mathematical tools needed to tackle sequential problems from a filtering perspective.
We start by examining signal-plus-noise models and finding their best linear unbiased predictors (BLUPs) — a statistical estimate of a signal given noisy measurements.
Then, we introduce the concept of innovations, a key idea that simplifies BLUP computation.
Building on this, we show how fundamental operations — such as filtering prediction, smoothing, and fixed-lag smoothing —
arise naturally when applying BLUPs across different timeframes.
Finally, we show how to carry out the computations discussed in this post when the signal-plus-noise process
is sampled from a noisy Lotka-Volterra model. The most fundamental assumption in all methods the techniques and models that we will study
is the signal plus noise (SPN) assumption.
The SPN assumption models a random process of univariate real-valued observations (or measurements)
$Y_{1:T} = (Y_1, \ldots, Y_T)^\intercal$
as the sum of two components:
a signal process
$F_{1:T} \in \mathbb{R}^T$
and a noise process
${E}_{1:T} \in \mathbb{R}^T$.
For example, if $Y_t$ is the return of a stock at time $t$,
$F_t$ is the signal, alpha, or trend of the stock, and
$E_t$ is market noise.
We write a model with SPN assumption as
$$
\tag{S.1}
\underbrace{Y_{1:T}}_\text{measurement} =
\underbrace{F_{1:T}}_{\text{signal}} +
\underbrace{{E}_{1:T}}_{\text{noise}}.
$$
For a SPN process, we assume
$$
\begin{aligned}
\cov(F_t, E_j) = 0 & \quad \forall \quad t,j,\\
\cov(E_t, E_j) = 0 &\quad \forall\quad t \neq j,
\end{aligned}
$$
meaning that the noise term is not correlated to the signal and is not correlated to any other noise term.
Here, ${\cal T} = \{1, \ldots, T\}$.
Next, for simplicity, we assume
$\mathbb{E}[F_t] = 0$ and
$\mathbb{E}[{E}_t] = 0$. Given a SPN process, our goal is to determine the value of the signal $F_t$
given a sequence of observations $Y_{1:j}$. Let $F_t \in \mathbb{R}$ be the target signal and
let $Y_{1:j} = (Y_1, \ldots, Y_j)^\intercal \in \mathbb{R}^{j}$ be a subset of the observations;
here $(j,t)\in{\cal T}^2$.
Consider a linear predictor parametrised by a vector $\va \in \reals^j$
that estimates the signal $F_t$ given the observations $Y_{1:j}$:
$$
\va^\intercal\,Y_{1:j}.
$$ For mathematical convenience,
our choice of $\va\in{\mathbb R}^{j}$ is determined
by minimising the expected squared error between the target signal $F_t$ and the linear prediction
$\va^\intercal\,Y_{1:j}$ with respect to $\va$, that is,
$$
\va^* = \argmin_\va\mathbb{E}\left[\left(F_t - \va^\intercal\,Y_{1:j}\right)^2\right].
$$ Having found $\va^*$, the transformation
$F_{t|j} := (\va^*)^\intercal\,Y_{1:j}$
is called
best linear unbiased predictor (BLUP) of the signal $F_t$ given measurements $Y_{1:j}$ —
we formalise the idea of the BLUP in the proposition below. Let $(j,t)\in{\cal T}^2$.
The vector $\va^*$ that minimises the expected squared error between the signal random variable $F_{t}$
and the subset of the measurement process $Y_{1:j}$ takes the form
$$
\va^*
= \argmin_\va\mathbb{E}\left[\left(F_t - \va^\intercal\,Y_{1:j}\right)^2\right]
= \var(Y_{1:j})^{-1}\,\cov(Y_{1:j}, F_t).
$$
Then, the best linear unbiased predictor (BLUP) for the signal $F_t$ given $Y_{1:j}$ is
$$
\begin{aligned}
\tag{BLUP.1}
F_{t|j}
&= (\va^*)^\intercal\,Y_{1:j} \\
&= \cov(F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}.
\end{aligned}
$$ Furthermore, the error variance-covariance matrix of the BLUP is
$$
\tag{EVC.1}
\var(F_t - F_{t|j}) =
\var(F_t) - \va^{*\intercal}\,\var(Y_{1:j})\,\va^*.
$$ See the Appendix for a proof. If $\cov(F_t, Y_{1:j})$ and $\var(Y_{1:j})$ are known,
and we are given a sample of the measurement process $Y_{1:j}$,
then ${\rm (BLUP.1)}$ is the optimal way to extract the signal
from the signal-plus-noise process.
Whenever $j=t$, and we vary $j$, we call this procedure filtering.
We come back to this point below. In practice, however, estimating ${\rm (BLUP.1)}$ becomes computationally prohibitive because of the term
$\var(Y_{1:j})^{-1}$ that we have to invert at each timeframe $j$ of interest.
This imposes a computational cost of $O(j^3)$ for every $j$ of interest,
rendering the use of ${\rm (BLUP.1)}$ impractical for large $j$. To overcome this computational bottleneck,
we develop a process derived from the measurements that decorrelates the observations.
This process, referred to as the innovation process,
provides an alternative formulation of $(\text{BLUP.1})$ with a computational complexity that scales linearly with time. The innovation random variable
${\cal E}_t$,
derived from the measurement random variable $Y_{t}$ and the innovation process
${\cal E}_{1:t-1}$,
is defined by
$$
\tag{I.1}
{\cal E}_t =
\begin{cases}
Y_1 & \text{for } t = 1,\\
Y_t - \sum_{k=1}^{t-1} \frac{\cov(Y_t, {\cal E}_k)}{\var({\cal E}_k)}\,{\cal E}_k & \text{for } t \geq 2.
\end{cases}
$$
As we see, the innovation process
${\cal E}_{1:t}$
is built sequentially. Innovations have various properties that allows the working
with $(\text{BLUP.1})$ much more tractable,
we provide some of these properties in the proposition below. Let $Y_{1:T}$ be an SPN random process and
let ${\cal E}_{1:T}$ be the innovation process derived from $Y_{1:T}$.
The following properties hold: 1. Innovations are uncorrelated 2. Measurements as linear functions of innovations 3. Cholesky decomposition of measurements For a proof, see the Appendix. Take $(t,j) \in {\cal T}^2$ and
let
${\cal E}_{1:j}$
be the innovation process
derived from the measurement process $Y_{1:j}$.
Then, the BLUP of the signal $F_t$ given the measurement process
$Y_{1:j}$
is written
as a weighted sum of the first $j$ innovations weighted by gain values:
$$
\tag{BLUP.2}
\boxed{
\begin{aligned}
F_{t|j} &= \sum_{k=1}^j \hat{K}_{t,k}\,{\cal E}_k,\\
\hat{K}_{t,k} &= \frac{\cov(F_t, {\cal E}_k)}{\var({\cal E}_k)}.
\end{aligned}
}
$$
Here, $\hat{K}_{t,k}$
is the gain value of the signal $F_t$ given the innovation ${\cal E}_k$ —
this term is a signal-to-noise ratio between the innovation at time $k$ and the signal at time $t$,
relative to the variance of the innovation at time $k$. Furthermore, the error-variance covariance matrix between the signal and the BLUP takes the form
$$
\begin{aligned}
\tag{EVC.2}
\var(F_t - F_{t|j})
% = \var(F_t) - \sum_{k=1}^j \hat{K}_{t,k}\,\vS_k\,\hat{K}_{t,k}^\intercal.
= \var(F_t) - \sum_{k=1}^j \hat{K}_{t,k}^2\,\var({\cal E}_k).
\end{aligned}
$$ For a proof, see the Appendix. Equation $(\text{BLUP.2})$ highlights a crucial property of working with innovations:
the number of computations required to estimate $F_{t|j}$ becomes linear in time —
computing $(\text{BLUP.2})$ only requires $O(j)$ operations. It is important to note that the cubic complexity has not vanished;
rather, it has been shifted from the computation
of the BLUP to the computation of the innovations, which is achieved through the Cholesky decomposition.
As we will later see, introducing additional assumptions about the signal process, further reduces
the computational burden to estimate the BLUP. Having a target signal $F_t$,
the BLUP estimate $F_{t|j}$ of the form
$({\rm BLUP.2})$
performs a weighted sum of
innovations weighted by its corresponding signal-to-noise ratios (gain values).
In this formulation,
the only degree of freedom is the value of $j$ — the “frame of reference”.
This value determines how much information gets incorporated to estimate the signal at time $t$.
As a result, different choices of $j$ lead to different BLUP estimates,
reflecting the varying amounts of information used to estimate the signal. Broadly speaking, the choices for the frame of reference $j$ can be categorised as follows: We describe these quantities in more detail below. The term filtering refers to the action of filtering-out the noise $E_t$ to estimate the signal $F_t$.
Filtering is defined as
$$
\tag{F.1}
\boxed{
F_{t | t} = \sum_{k=1}^t \hat{K}_{t,k}\,{\cal E}_k.
}
$$
This quantity is one of the most important in the signal-processing theory. We exemplify filtering in the table below.
The top row shows the point in time in which the BLUP is estimated;
the middle row shows in $\color{#00B7EB}\text{cyan}$ the unknown target signal; and
the bottom row shows in $\color{orange}\text{orange}$ the measurements that we use to estimate the BLUP. $$
\begin{array}{c|ccccc}
\text{BLUP} & & & & F_{t|t} & & & \\
\text{signal} & F_{t-3} & F_{t-2} & F_{t-1} & \color{#00B7EB}{F_{t}} & F_{t+1} & F_{t+2} & F_{t+3}\\
\text{measurement} & \color{orange}{Y_{t-3}} & \color{orange}{Y_{t-2}} & \color{orange}{Y_{t-1}} & \color{orange}{Y_{t}} & Y_{t+1} & Y_{t+2} & Y_{t+3}\\
\hline
\text{time} & t-3 & t-2 & t-1 & t & t+1 & t+2 & t+3
\end{array}
$$
Filtering considers all measurements up to time $t$ to make an estimate of the signal at time $t$.
Because we only the latest measurements up to time $t$ to estimate the signal at time $t$, it is said that filtering is online. This quantity estimates the future signal $F_{t+i}$,
given the measurements $Y_{1:t}$.
Here, $i \geq 1$.
We define an $i$-th step ahead prediction as
$$
\tag{F.2}
\boxed{
% f_{t + i |t} = \sum_{k=1}^{t} \cov(f_{t+1}, \varepsilon_k)\,S_k^{-1}\,\varepsilon_k.
F_{t + i | t} = \sum_{k=1}^t \hat{K}_{t+i,k}\,{\cal E}_k.
}
$$ We exemplify a two-step-ahead prediction in the table below
$$
\begin{array}{c|ccccc}
\text{BLUP} & & & & F_{t+2|t} & & & \\
\text{signal} & F_{t-3} & F_{t-2} & F_{t-1} & F_{t} & F_{t+1} & \color{#00B7EB}{F_{t+2}} & F_{t+3}\\
\text{measurement} & \color{orange}{Y_{t-3}} & \color{orange}{Y_{t-2}} & \color{orange}{Y_{t-1}} & \color{orange}{Y_{t}} & Y_{t+1} & Y_{t+2} & Y_{t+3}\\
\hline
\text{time} & t-3 & t-2 & t-1 & t & t+1 & t+2 & t+3
\end{array}
$$
As shown in the table above, a two-step-ahead prediction considers all measurements up to time $t$ to
make an estimate of the signal at time $t+2$. This quantity estimates the signal $F_t$ having seen all measurements $Y_{1:T}$.
We define smoothing as
$$
\tag{F.3}
\boxed{
% f_{t | T} = \sum_{k=1}^T \cov(f_t, \varepsilon_k)\,S_k^{-1}\,\varepsilon_k.
F_{t | T} = \sum_{k=1}^T \hat{K}_{t,k}\,{\cal E}_k.
}
$$ We exemplify smoothing at time $t$ in the table below
$$
\begin{array}{c|ccccc}
\text{BLUP} & & & & & & & F_{t|T} \\
\text{signal} & F_{t-3} & F_{t-2} & F_{t-1} & \color{#00B7EB}{F_{t}} & F_{t+1} & \ldots & F_T\\
\text{measurement} & \color{orange}{Y_{t-3}} & \color{orange}{Y_{t-2}} & \color{orange}{Y_{t-1}}
& \color{orange}{Y_{t}} & \color{orange}{Y_{t+1}} & \ldots & \color{orange}{Y_{T}}\\
\hline
\text{time} & t-3 & t-2 & t-1 & t & t+1 & \ldots & T
\end{array}
$$
Contrary to the filtering equation $(\text{F.1})$, which is online, the smoothing operation
requires all information up to time $T$ to make an estimate of the signal at time $t$.
In this sense, smoothing is offline. This quantity is a middle ground between filtering, which is online, and smoothing, which is offline.
The idea behind an $i$-step fixed-lag smoothing is to estimate the signal $F_t$ having $Y_{1:t+i}$.
That is, we must wait $i$ steps, before making an estimate of the signal $F_t$.
$$
\tag{F.4}
\boxed{
% f_{t | t + i} = \sum_{k=1}^{t+i} \cov(f_t, \varepsilon_k)\,S_k^{-1}\,\varepsilon_k.
F_{t | t + i} = \sum_{k=1}^{t+i} \hat{K}_{t,k}\,{\cal E}_k.
}
$$ We exemplify a two-step fixed-lag smoothing below
$$
\begin{array}{c|ccccc}
\text{BLUP} & & & & & & F_{t|t+2} & \\
\text{signal} & F_{t-3} & F_{t-2} & F_{t-1} & \color{#00B7EB}{F_{t}} & F_{t+1} & F_{t+2} & F_{t+3}\\
\text{measurement} & \color{orange}{Y_{t-3}} & \color{orange}{Y_{t-2}} & \color{orange}{Y_{t-1}} & \color{orange}{Y_{t}}
& \color{orange}{Y_{t+1}} & \color{orange}{Y_{t+2}} & Y_{t+3}\\
\hline
\text{time} & t-3 & t-2 & t-1 & t & t+1 & t+2 & t+3
\end{array}
$$
We observe that we require information up to time $t+2$ to make a prediction of the signal at time $t$. We generalise the ideas of the previous section by allowing the observations to
be $D$-dimensional, i.e.,
$Y_t \in \mathbb{R}^D$ with $D > 1$.
The multivariate SPN model is composed of
a random observation matrix $Y_{1:T}$,
a random signal matrix $F_{1:t}$, and
a noise matrix $E_{1:t}$, all living in $\in \mathbb{R}^{T\times D}$. In the multivariate setting, computations of the BLUP amounts to simply
making an additional indexing over all dimensions (components). We begin by writing the BLUP in the multivariate case.
In this scenario, we seek a matrix
$\vA^* \in \mathbb{R}^{d\times j}$ such that
$$
\begin{aligned}
\vA^*
&= \argmin_\vA
\mathbb{E}\left[\|F_{t} - \text{diag}(\vA\,Y_{1:j})\|^2\right]
\\
&= \argmin_{\va_1, \ldots, \va_D}
\sum_{d=1}^D
\mathbb{E}\left[(F_{t,d} - (\va_d)^\intercal\,Y_{1:j,d})^2\right],
\end{aligned}
$$
where
$\va_d \in \mathbb{R}^{j}$ is the $d$-th column of $\vA$. Optimising for the $d$-th component, we obtain
$$
\va^*_d
= \argmin_{\bf a_d}\mathbb{E}\left[\|F_{t,d} - (\va_{d})^\intercal\,Y_{1:j,d}\|^2\right]
= \var(Y_{1:j,d})^{-1}\,\cov(Y_{1:j,d}, F_{t,d}).
$$ As a consequence, the BLUP for the $d$-th component takes the form
$$
F_{t|j,d} = (\va_d^*)^\intercal\,Y_{1:j,d}.
$$ Because the BLUP $F_{t|j}$ only depends on observations in the $d$-th component,
we generalise the idea of an innovation to multivariate random processes by simply computing the
cholesky decomposition of the variance of the observations at the $d$-th component.
For the $d$-th component of the observation process, we have
$$
\tag{I.6}
\var(Y_{1:j,d}) = \vL_d\,\vS_{d}\,\vL_d^\intercal.
$$ So that
$$
\tag{I.7}
Y_{1:j,d} = \vL_{d}\,{\cal E}_{1:j,d} \iff {\cal E}_{1:j,d} = \vL_d^{-1}\,Y_{1:j,d}.
$$ Next, we generalise the gain values to the multidimensional case.
Let
$$
\tag{G.2}
\hat{K}_{t,k,d} = \frac{\cov(F_{t,d}, {\cal E}_{k,d})}{\var({\cal E}_{k,d})}
$$
be the gain of the $d$-th component for the signal at time $t$ and innovation at time $k$. As before, the BLUP of the signal $F_{t,d}$ given the measurement process
$Y_{1:j,d}$
is written
as a weighted sum of innovations weighted by gain values:
$$
\tag{BLUP.3}
F_{t|j,d} = \sum_{k=1}^j \hat{K}_{t,k,d}\,{\cal E}_{k,d}.
$$ For the error variance-covariance matrix at time $t$ using the BLUP at time $j$, we have
$$
\tag{EVC.3}
\begin{aligned}
\var(F_{t,d} - F_{t|j,d})
% = \var(F_{t,d}) - \sum_{k=1}^j \hat{K}_{t,k,d}\,\vS_{k,d}\,\hat{K}_{t,k,d}^\intercal
= \var(F_{t,d}) - \sum_{k=1}^j \hat{K}_{t,k,d}^2\,\vS_{k,d}.
\end{aligned}
$$ Finally, the multivariate BLUP is found by stacking each unidimensional BLUP row-wise:
$$
F_{t|j} =
\begin{bmatrix}
{F}_{t|j,1}\\
\vdots \\
{F}_{t|j,D}\\
\end{bmatrix} =
\begin{bmatrix}
(\va_{1}^*)^\intercal\,Y_{1:j, 1} \\
\vdots\\
(\va_{D}^*)^\intercal\,Y_{1:j, D} \\
\end{bmatrix}.
$$ In previous sections, we showed that the observations $Y_{1:T}$ in a signal-plus-noise process,
can be transformed into a decorrelated innovation process
${\cal E}_{1:T}$.
The innovation process allows us to compute the best linear unbiased predictor (BLUP) for the signal at time $t$
based on information available up to time $j$ — denoted $F_{t|j}$ — in a way that is linear with respect to $j$. Now, let’s assume we have a training phase where we observe samples for both the signal and the noise processes.
With these samples, we estimate the quantities
$\vL$ and
$\hat{K}_{t,k,d}$ for
$(t,k)\in{\cal T}^2$. Furthermore, assume we have a test phase, where we are only given a sample of the observation process, which we denote $y_{1:T}$.
Because in the test phase do not have direct access to either the signal or noise processes,
we must rely on estimates derived from the BLUP formula $(\text{BLUP.2})$ using the measurement $y_{1:j}$.
To do this, We provide a detailed example below. In this example, we consider a data-driven approach to estimate the BLUP over multiple frames of reference.
We divide this experiment into a train phase where the quantities
$\cov(F_t, {\cal E}_k)$,
$\cov(Y_t, {\cal E}_k)$,
$\vS_k = \var({\cal E}_k)$, and
$\vL$
are estimated;
and a test phase where the BLUP estimates $(\text{F.1 – F.4})$
are are obtained from sampled measurement $y_{1:T}$ not seen in the train phase. For this experiment, we make use of the We define the discretised noisy Lotka-Volterra model as the following SPN process
$$
\tag{LV.1}
\begin{aligned}
\frac{f_{t + \Delta, 1} - f_{t, 1}}{\Delta} &= \alpha\,f_{t,1} - \beta\,f_{t,1}\,f_{t,2} + \phi_{t,1},\\
\frac{f_{t + \Delta, 2} - f_{t, 2}}{\Delta} &= \delta\,f_{t,1}\,f_{t,2} - \gamma\,f_{t,2} + \phi_{t, 2},\\
y_{t + \Delta, 1} &= f_{t + \Delta, 1} + \varphi_{t,1},\\
y_{t + \Delta, 2} &= f_{t + \Delta, 2} + \varphi_{t,2},
\end{aligned}
$$ with
$\phi_{t,i} \sim {\cal N}(0, \sigma_f^2 / \Delta)$,
$\varphi_{t,j} \sim {\cal N}(0, \sigma_y^2)$,
$(i,j) \in {0,1}^2$,
$\sigma_f^2, \sigma_y^2 > 0$,
$\Delta \in (0, 1)$, and
$\alpha, \beta, \gamma, \delta$ values in the $(0,1)$ interval. Consider samples of the process $(\text{LV.1})$ with the following parameters:
$\alpha = 2/3$,
$\beta = 4/3$,
$\gamma = 0.8$,
$\delta = 1.0$,
$\Delta = 0.01$,
$\sigma_f^2 = 0.02^2$, and
$\sigma_y^2 = 0.1^2$.
We integrate the system for $T=1500$ steps, each starting at $(f_{0,1}, f_{0,2}) = (1.0, 1.0) + (u_1, u_2)$,
with $u_i \sim \cal{U}[-0.2, 0.2]$. The following plot shows multiple samples of of this process.
The black line shows the signal $(f_{t,1}, f_{t,2})$ and the coloured line shows the measurements $(y_{t,1}, y_{t,2})$
for $t=1, \ldots, T.$
Our task is to estimate the black lines given the coloured lines. In the train phase, we consider $N=2000$ samples of $(\text{LV.1})$ following the configuration outlined above.
To enforce the constraint, $\mathbb{E}[F_t] = 0$, we de-mean the samples. Programmatically, assume we have a numpy array of measurements We begin by estimating the matrices $\vL_d$. Recall from $(\text{I.6})$ that
$\var(Y_{1:T,d}) = \vL_d\,\vS_d\,\vL_d^\intercal$.
Then, to estimate $\vL_d$ and $\vS_d$, we first approximate the variance of the measurement process
as follows:
$$
\begin{aligned}
\var(Y_{1:T,d})
&= \mathbb{E}[(Y_{1:T,d} - \hat{Y}_{1:T,d})(Y_{1:T,d} - \hat{Y}_{1:T,d})^\intercal]\\
&\approx \frac{1}{N}\sum_{s=1}^N \left(y_{1:T,d}^{(s)} - \bar{y}_{1:T,d}\right)\,\left(y_{1:T,d}^{(s)} - \bar{y}_{1:T,d}\right)^\intercal\\
&= \frac{1}{N}\sum_{s=1}^N \left(y_{1:T,d}^{(s)}\right)\,\left(y_{1:T,d}^{(s)}\right)^\intercal\\
\end{aligned}
$$
where we make use of the fact that the sample mean is zero by construction, i.e.,
$\bar{y}_{1:T,d} = \frac{1}{N}\sum_s y_{1:T,d}^{(s)} = 0$.
In code, the variance of the measurements is estimated by Next, the terms $\vL_d$ and $\vS_d$ for all $d=1,\ldots,D$ are estimated through a Cholesky decomposition of the sample variance.
This is done as follows Finally, we estimate the innovations derived from the samples Having estimated the innovations, the next step is to compute the gain values.
Following
$({\rm G.2})$,
we have
$$
\begin{aligned}
\hat{K}_{t,k,d}
&= \cov(F_{t,d}, {\cal E}_{k,d})\,\var({\cal E}_{k,d})^{-1}\\
&\approx\left(\frac{1}{N}\sum_{s=1}^N\left(f_{t,d}^{(s)} - \bar{f}_{t,d}\right)\,\left(\varepsilon_{k,d}^{(s)} - \bar{\varepsilon}_{k,d}\right)\right)\,\vS_{k,d}^{-1}\\
&\approx\frac{1}{N}\sum_{s=1}^N\frac{f_{t,d}^{(s)} \, \varepsilon_{k,d}^{(s)}}{\vS_{k,d}},
\end{aligned}
$$
where
$\bar{f}_{t,d} = \bar{\varepsilon}_{k,d} = 0$
for all $(t,k) \in {\cal T}^2$.
In code, this is done as follows The next Figure shows the gain matrices $\vK_{t,k,d}$ for $d=1,2$, corresponding to the $x$ and $y$ coordinates
of the process.
We observe that in each panel, the values below the diagonal have non-zero values for $K_{t,k,i}$,
whereas values above the diagonal are close to zero.
This suggests that, for any two points in time $t$ and $k$, with $t$ being the target index and $k$ the
frame of reference, the information that $k$ brings to inform the timestep $j$
is non-zero if $t \geq k$.
Conversely, because our SPN process runs forward in time, for $t < k$,
how much $k$ is used to inform $t$ is close to zero. For the error variance-covariance matrix at time $t$ using the BLUP at time $j$, we have
$$
\begin{aligned}
\var(F_{t,d} - F_{t|j,d})
= \var(F_{t,d}) - \sum_{k=1}^j \hat{K}_{t,k,d}\,\vS_{k,d}\,\hat{K}_{t,k,d}^\intercal
\end{aligned}
$$
where
$$
\begin{aligned}
\var(F_{1:T,d})
&= \mathbb{E}[(F_{1:T,d} - \bar{F}_{1:T,d})(F_{1:T,d} - \bar{F}_{1:T,d})^\intercal]\\
&\approx \frac{1}{N}\sum_{s=1}^N \left(f_{1:T,d}^{(s)} - \bar{f}_{1:T,d}\right)\,\left(f_{1:T,d}^{(s)} - \bar{f}_{1:T,d}\right)^\intercal\\
&= \frac{1}{N}\sum_{s=1}^N \left(f_{1:T,d}^{(s)}\right)\,\left(f_{1:T,d}^{(s)}\right)^\intercal.
\end{aligned}
$$ In code, this is estimated as Having The next Figure shows the correlation matrix of the signal process.
Having computed Here, we compute the BLUP corresponding to the filter estimate, i.e., $k=t$.
Because we assume that we have access to $y_{1:T}$, we compute the filter estimate by
use of masking, i.e.,
$$
\begin{aligned}
f_{t|T,d}
&= \sum_{k=1}^t\hat{K}_{t,k,d}\,\varepsilon_{k,d}\\
&= \sum_{k=1}^T\hat{K}_{t,k,d}\,\varepsilon_{k,d}\,\vone(k \leq t).
\end{aligned}
$$ This takes the form Next, the standard error of the blup takes the form The image below shows the filtering of the true (latent) signal (in black)
and the filtered signal (in colour).
Around the estimated signal, we plot a one standard deviation error in the estimation of the
true position of the signal. This correponds to Next, we consider the problem of smoothing.
This is written as a straightforward modification of the code-block above,
in which we remove the masking element. We obtain
We observe that, relative to the smoothing exercise, the recovered signal has much less variance. Here, we consider the problem of five-step-ahead prediction.
In code, prediction is written as Here, we plot the cumulative RMSE as a function of time $t$ and as a
function of lag $k$.
We define the cumulative RMSE at time $t$, under lag $k$ as
$$
E_k(t) = \left(\frac{1}{t}\sum_{\tau=1}^t\|f_\tau - f_{\tau|k}\|^2\right)^{1/2}
$$
The plot below shows $E_k(t)$ for varying $k$ and $t$.
The black line ($k=0$) corresponds to filtering.
We observe that lags $k< 0$ have higher RMSE than $k=0$ — these correspond to prediction.
On the other hand lags $k > 0$ have higher rmse than $k = 0$ — these correspond to fixed-lag smoothing. Here, we repeat the above experiment for multiple samples.
We run multiple simulations of $(\text{LV.1})$, compute $E_k(T)$ for each of the samples,
and plot the final average RMSE across samples.
As expected, prediction ($k < 0$) incurs in higher RMSE than fixed-lag smoothing ($k > 0$). In this post,
(i) we introduced signal plus noise models and their best linear unbiased predictions (BLUP),
(ii) we introduced the concept of an innovation to decorrelate the measurements, and
(iii) we made use of the innovations to arrive at linear-in-time formulas to compute the BLUP.
We showed that an important quantity of the blup BLUP is its frame of reference.
Depending on the frame of reference, we arrive at filtering, smoothing, prediction, and fixed-lag smoothing. Here, we provide a proof of Proposition 1. Let
$$
{\cal L}(\va) = \mathbb{E}\left[\|F_t - \va^\intercal\,Y_{1:j}\|^2\right].
$$
Then,
$$
\begin{aligned}
\nabla_{\va^\intercal}\,{\cal L}(\va^\intercal)
&= 2\,\mathbb{E}\left[(F_t - \va^\intercal\,Y_{1:j})\,Y_{1:j}^\intercal\right]\\
&= 2\,\left( \mathbb{E}\left[F_t\,Y_{1:j}^\intercal\right]-\mathbb{E}[\va^\intercal\,Y_{1:j}\,Y_{1:j}^\intercal]\right)\\
&= 2\,\left( \mathbb{E}\left[F_t\,Y_{1:j}^\intercal\right]-\va^\intercal\,\mathbb{E}[Y_{1:j}\,Y_{1:j}^\intercal]\right)\\
&= 2\,\left( \cov (F_t, Y_{1:j}) - \va^\intercal\,\var(Y_{1:j})\right)
\end{aligned}
$$
Setting the last term to zero and solving for $\va$ yields
$(\va^*)^\intercal = \cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}$,
as required. Next, for the error variance-covariance matrix, we have
$$
\begin{aligned}
&\var(F_t - F_{t|j})\\
&=\mathbb{E}\left[(F_t - F_{t|j})^2\right]\\
&=\mathbb{E}\left[F_t^2 - 2\,F_{t|j}\,F_t + F_{t|j}^2\right]\\
&=\mathbb{E}\left[F_t^2 - 2\,(\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j})\,F_t + (\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j})^2\right]\\
&=\mathbb{E}\left[F_t^2\right]
- \mathbb{E}\left[2\,\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}\,F_t\right]
+ \mathbb{E}\left[(\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j})^2\right]\\
&=\mathbb{E}\left[F_t^2\right]
- \mathbb{E}\left[2\,\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}\,F_t\right]
+ \mathbb{E}\left[\cov (F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}\,Y_{1:j}\,\var(Y_{1:j})^{-1}\cov(Y_{1:j}, F_t)\right]\\
&= \var(F_t) - 2\,(\va^*)^\intercal\,\cov(Y_{1:j}, F_t) + (\va^*)^\intercal\,\var(Y_{1:j})\,\va^*\\
&= \var(F_t) - 2\,(\va^*)^\intercal\,\var(Y_{1:j})\,\var(Y_{1:j})^{-1}\,\cov(Y_{1:j}, F_t) +
(\va^*)^\intercal\,\var(Y_{1:j})\,\va^*\\
&= \var(F_t) - 2\,(\va^*)^\intercal\,\var(Y_{1:j})\,\va^* +
(\va^*)^\intercal\,\var(Y_{1:j})\,\va^*\\
&= \var(F_t) - \,(\va^*)^\intercal\,\var(Y_{1:j})\,\va^*.
\end{aligned}
$$ Here, we provide a proof of Proposition 2. To show $(\text{I.2})$, first note that the diagonal terms
$\cov({\cal E}_t, {\cal E}_t) = \var({\cal E}_t) = S_t$
for all $t \in {\cal T}$. Next, we show that the off-diagonal terms are zero.
Observe that
$$
\begin{aligned}
&\cov({\cal E}_1, {\cal E}_2)\\
&= \cov\Big(Y_1,\,Y_2 - \cov(Y_2,\,{\cal E}_1)S_1^{-1}{\cal E}_1\Big)\\
&= \cov(Y_1,\,Y_2) - \cov\Big(Y_1,\,\cov(Y_2,\,{\cal E}_1)S_1^{-1}{\cal E}_1\Big)\\
&= \cov(Y_1,\,Y_2) - \cov\Big(Y_1,\,\cov(Y_2,\,Y_1)S_1^{-1}Y_1\Big)\\
&= \cov(Y_1,\,Y_2) - \cov(Y_1, Y_1)S_1^{-1}\cov(Y_1\,Y_2)\\
&= \cov(Y_1,\,Y_2) - S_1\,S_1^{-1}\cov(Y_1\,Y_2)\\
&= 0.
\end{aligned}
$$
By symmetry of the covariance matrix, we obtain
$\cov({\cal E}_2,\,{\cal E}_1) = (\cov({\cal E}_1,\,{\cal E}_2))^\intercal = 0$.
A similar procedure shows that
$\cov({\cal E}_1, {\cal E}_3) = \cov({\cal E}_3, {\cal E}_1) = 0$.
The general case holds by induction: Suppose that
$$
\cov({\cal E}_i, {\cal E}_k) = 0 \ \text{for } i\geq 2,\,k=i+1,\ldots,j-1.,
$$
i.e., an upper-triangular (off-diagonal) assumption.
We show
$\cov({\cal E}_i, {\cal E}_j) = 0$. By definition,
$$
\begin{aligned}
&\cov({\cal E}_i,\,{\cal E}_j)\\
&= \cov\left({\cal E}_i,\,Y_j - \sum_{k=1}^{j-1}\cov(Y_j, {\cal E}_k)S_k^{-1}{\cal E}_k\right)\\
&= \cov\left({\cal E}_i,\,Y_j -
\sum_{\substack{k\neq i \\ {1 \leq k\leq j-1}}}\cov(Y_j,\,{\cal E}_k)S_k^{-1}{\cal E}_k
- \cov(Y_j,\,{\cal E}_i)S_i^{-1}{\cal E}_i
\right)\\
&=\cov({\cal E}_i,\, Y_j)
- \sum_{\substack{k\neq i \\ {1 \leq k\leq j-1}}}\cov({\cal E}_i, {\cal E}_k)S_k^{-1}\cov({\cal E}_k\,Y_j)
- \cov({\cal E}_i, {\cal E}_i)S_i^{-1}\,\cov({\cal E}_i, Y_j)\\
&= - \sum_{k=1}^{i-1}\cov({\cal E}_i, {\cal E}_k)S_k^{-1}\cov({\cal E}_k\,Y_j)
- \sum_{k=i+1}^{j-1}\cov({\cal E}_i, {\cal E}_k)S_k^{-1}\cov({\cal E}_k\,Y_j)\\
&= - \sum_{k=1}^{i-1}(\cov({\cal E}_k, {\cal E}_i))^\intercal S_k^{-1}\cov({\cal E}_k\,Y_j)
- \sum_{k=i+1}^{j-1}\cov({\cal E}_i, {\cal E}_k)S_k^{-1}\cov({\cal E}_k\,Y_j)\\
&= 0.
\end{aligned}
$$
Where the last equality follows from the assumption that
$\cov({\cal E}_i, {\cal E}_k) = 0$ for
$k = i + 1, \ldots, j - 1$. Next, we show $(\text{I.3})$.
By definition, the innovation at time $t$ is
$$
{\cal E}_t = Y_t - \sum_{k=1}^{t-1}\cov(Y_t, {\cal E}_k)\,\vS_k^{-1}\,{\cal E}_j
$$
Define the lower triangular matrix $\vL$ as in $(\text{I.4})$.
Then
$$
\begin{aligned}
{\cal E}_t
&= Y_t - \sum_{k=1}^{t-1}\vL_{t,k}\,{\cal E}_k\\
&= Y_t + \vL_{t,t}\,{\cal E}_t - \vL_{t,t}\,{\cal E}_t - \sum_{k=1}^{t-1}\vL_{t,k}\,{\cal E}_k\\
&= Y_t + \vL_{t,t}\,{\cal E}_t - \sum_{k=1}^{t}\vL_{t,k}\,{\cal E}_k\\
&= Y_t + {\cal E}_t - \sum_{k=1}^{t}\vL_{t,k}\,{\cal E}_k.
\end{aligned}
$$
The last equality corresponds to the $t$-th entry of the vector resulting from
$(\vL\,{\cal E}_{1:T})_t$.
So that
${\cal E}_t = Y_t + {\cal E}_t - (\vL\,{\cal E}_{1:T})_t$. Finally, write the innovation vector as
$$
\begin{aligned}
&{\cal E}_{1:T} = Y_{1:T} + {\cal E}_{1:T} - \vL\,{\cal E}_{1:T}\\
\iff & 0 = Y_{1:T} - \vL\,{\cal E}_{1:T}\\
\iff & Y_{1:T} = \vL\,{\cal E}_{1:T}.
\end{aligned}
$$ Following $(\text{I.3})$, we have
$$
\begin{aligned}
\var(Y_{1:t})
&= \var(\vL\,{\cal E}_{1:T})\\
&= \cov(\vL\,{\cal E}_{1:T}, \vL\,{\cal E}_{1:T})\\
&= \vL\,\cov({\cal E}_{1:T}, {\cal E}_{1:T})\vL^\intercal\\
&= \vL\,\var({\cal E}_{1:T})\vL^\intercal\\
&= \vL\,\vS\,\vL^\intercal.
\end{aligned}
$$
Because $\vL$ is a lower triangular matrix, it follows that the last equality corresponds to the Cholesky
decomposition of $\var(Y_{1:T})$
$$ \ \tag*{$\blacksquare$} $$ Here, we provide a detailed proof of Proposition 3. Let
$S_k = \var({\cal E}_k)$.
Using $(\text{BLUP.1})$ and $(\text{I.2})$, we see that
$$
\begin{aligned}
(\va^*)^\intercal
&= \cov(F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\\
&= \cov(F_t, \vL\,{\cal E}_{1:j})\,\var(\vL\,{\cal E}_{1:j})^{-1}\\
&= \cov(F_t, {\cal E}_{1:j})\,\vL^\intercal\,\{\vL\var({\cal E}_{1:j})\vL^\intercal\}^{-1}\\
&= \cov(F_t, {\cal E}_{1:j})\,\vL^\intercal\,\vL^{-\intercal}\, \var({\cal E}_{1:j})^{-1}\vL^{-1}\\
&= \cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:j})^{-1}\vL^{-1}\\
&= \cov(F_t, {\cal E}_{1:j})\, {\rm diag}(\var({\cal E}_1), \ldots, \var({\cal E}_j))^{-1}\vL^{-1}\\
&= \cov(F_t, {\cal E}_{1:j})\, {\rm diag}(S_1, \ldots, S_j)^{-1}\vL^{-1}.
\end{aligned}
$$
Then, the BLUP of the signal $F_t$ given $Y_{1:j}$ is
$$
\begin{aligned}
F_{t|j}
&= (\va^*)^\intercal\,Y_{1:j}\\
&= \cov(F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}\\
&= \cov(F_t, {\cal E}_{1:j})\, {\rm diag}(S_1, \ldots, S_j)^{-1}\vL^{-1}\,\vL\,{\cal E}_{1:j}\\
&= \cov(F_t, {\cal E}_{1:j})\, {\rm diag}(S_1, \ldots, S_j)^{-1}\,{\cal E}_{1:j}\\
&= \cov(F_t, {\cal E}_{1:j})\, {\rm diag}(S_1^{-1}, \ldots, S_j^{-1})\,{\cal E}_{1:j}\\
&= \sum_{k=1}^j \cov(F_t, {\cal E}_k)\,S_k^{-1}\,{\cal E}_k\\
&= \sum_{k=1}^j \vK_{t,k}\,{\cal E}_k.
\end{aligned}
$$ Furthermore, the error variance-covariance matrix of the BLUP takes the form
$$
\begin{aligned}
&\var(F_t - F_{t|j})\\
&= \var(F_t) - (\va^*)^\intercal\,\var\,(Y_{1:j})\,\va^*\\
&= \var(F_t) -
\cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\vL^{-1}
\{\vL\var({\cal E}_{1:j})\vL^\intercal\}
\left(\cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\vL^{-1}\right)^\intercal\\
&= \var(F_t) -
\cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\vL^{-1}
\vL\var({\cal E}_{1:j})\vL^\intercal
\left(\cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\vL^{-1}\right)^\intercal\\
&= \var(F_t) -
\cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\vL^{-1}
\vL\var({\cal E}_{1:j})\vL^\intercal\,
\vL^{-\intercal}\,\var({\cal E}_{1:t})^{-1}\,\cov({\cal E}_{1:j}, F_t)\\
&= \var(F_t) -
\cov(F_t, {\cal E}_{1:j})\, \var({\cal E}_{1:t})^{-1}\var({\cal E}_{1:j})\,
\var({\cal E}_{1:t})^{-1}\,\cov({\cal E}_{1:j}, F_t)\\
&= \var(F_t) -
\cov(F_t, {\cal E}_{1:j})\, {\rm diag}(S_1, \ldots, S_j)^{-1}
{\rm diag}(S_1, \ldots, S_j)\,
{\rm diag}(S_1, \ldots, S_j)^{-1}\,\cov({\cal E}_{1:j}, F_t)\\
&= \var(F_t) - \sum_{k=1}^j\cov(F_t, {\cal E}_k)\,S_k^{-1}S_k\,S_k^{-1}\,\cov({\cal E}_k, F_t)\\
&= \var(F_t) - \sum_{k=1}^j \left(\cov(F_t, {\cal E}_k)\,S_k^{-1}\right)S_k\,\left(\cov(F_t, {\cal E}_k)\,S_k^{-1}\right)^\intercal\\
&= \var(F_t) - \sum_{k=1}^j {K}_{t,k}\,S_k\,{K}_{t,k}^\intercal\\
&= \var(F_t) - \sum_{k=1}^j {K}_{t,k}^2\,S_k.
\end{aligned}
$$
$$ \ \tag*{$\blacksquare$} $$Introduction
Signal plus noise models — the univariate case
The best linear unbiased predictor (BLUP)
A linear predictor
The best linear predictor
Proposition 1 — best linear unbiased predictor (BLUP)
The innovation process
Definition: innovation
Proposition 2 — properties of innovations
$$
\tag{I.2}
\cov({\cal E}_t, {\cal E}_j)= \var({\cal E}_t)\,\mathbb{1}(t = j).
$$
The measurement process and the innovation process satisfy the relationship
$$
\tag{I.3}
Y_{1:T} = \vL\,{\cal E}_{1:T},
$$
with $\vL$ a $(T\times T)$ lower-triangular matrix with elements
$$
\tag{I.4}
\vL_{t,j} =
\begin{cases}
\frac{\cov(Y_t, {\cal E}_j)}{\var({\cal E}_j)} & \text{if } j < t, \\
\vI & \text{if } j = t, \\
{\bf 0 } & \text{if } j > t.
\end{cases}
$$
The variance of the measurement process satisfies
$$
\tag{I.5}
\var(Y_{1:T}) = \vL\,\vS\,\vL^\intercal,
$$
with
$\vS = {\rm diag}(\var({\cal E}_1), \ldots, \var({\cal E}_{T}))$.
This corresponds a Cholesky decomposition of the variance matrix for the measurements.The BLUP under innovations
Proposition 3 — additive BLUP
Filtering, prediction, smoothing, and fixed-lag smoothing
Filtering
Prediction
Smoothing
Fixed-lag smoothing
Signal plus noise models – the multivariate case*
The multivariate BLUP
The multivariate innovations
The multivariate BLUP under innovations
Stacking
A machine learning approach to estimate the BLUP — from random variables to samples
Example: a data-driven BLUP
numpy
library, which includes the einsum operator,
and the einops
library.
For a review of the np.einsum
function (which we use extensively throughout this experiment) see the post
einsums in the wild.Discretised noisy Lotka-Volterra model
The setup
Train phase
y_sims
and a numpy array of signals f_sims
with f_sims.shape == y_sims.shape == (1500, 2, 2000)
corresponding to the 1500 steps of the process, two dimensions, and 2000 samples.
Mathematically, we denote $y_{t,d}^{(s)}$ the $d$-th dimension of the $s$-th training observed sample at time $t$.Computation of innovations
1
V = np.einsum("tds,kds->dtk", y_sims, y_sims) / (n_sims - 1)
1
2
3
4
L = np.linalg.cholesky(V) # Cholesky decomposition of the variance
S = np.einsum("dtt->dt", L) # S-terms are diagonal of the output
L = np.einsum("dtk,dk->dtk", L, 1 / S) # Make diagonal be 1-valued
S = S ** 2 # Make variance
y_sims
and the matrix L
following
$({\rm I.7})$.1
2
3
ve_sims = einops.rearrange(y_sims, "t d s -> d t s") # Place in dims to solve system
ve_sims = np.linalg.solve(L, ve_sims) # Solve system, obtain innovations
ve_sims = einops.rearrange(ve_sims, "d t s -> t d s") # Match dimension ordering of measurements
Computation of the gain values
1
K = np.einsum("tds,kds,dk->tkd", f_sims, ve_sims, 1 / S) / (n_sims - 1)
Computation of the error variance-covariance matrix
1
var_signal = np.einsum("tds,kds->dtk", f_sims, f_sims) / n_sims
var_signal
, we compute the correlation matrix of the signal process using1
2
std_var = np.sqrt(var_signal)
cov_signal = np.einsum("dij,dii,djj->dij", var_signal, 1 / std_var, 1 / std_var)
Test phase
K
, L
, and var_signal
, we now evaluate the BLUP estimates $(\text{F.1 – F.4})$
on an unseen run ve_test_sim
with ve_test_sim.shape == (1500, 2)
.
Furthermore, we also evaluate error-bounds around our estimate of $f$.
We consider the run below.
Filter
1
2
tmask = np.tril(np.ones((T, T)), k=0)
latent_filter = np.einsum("tkd,kd,tk->td", K, ve_test_sim, tmask)
1
2
3
4
5
std_filter = (
np.einsum("dtt->td", var_signal) -
np.einsum("tkd,dk,ktd,tk->td", K, S, K, tmask)
)
std_filter = np.sqrt(std_filter)
latent_filter ± std_filter
.
Smoothing
1
latent_smooth = np.einsum("tkd,kd->td", K, ve_test_sim)
Prediction
1
2
tmask = np.tril(np.ones((n_steps, n_steps)), k=-5)
latent_pred = np.einsum("tkd,kd,tk->td", K, ve_test_sim, tmask)
Varying lag
Multiple simulations
Conclusion
A summary — BLUP for a unidimensional signal plus noise model
Appendix
Proof of proposition 1
Proof of proposition 2
Proof of (I.2)
Proof of (I.3)
Proof of (I.5)
Proof of proposition 3
References