[slides] Bayesian online learning in non-stationary environments
Framework that encompasses many Bayesian models that tackle online learning in non-stationary environments. Observe sequence of features $x_t$ and observations $y_t$:
$$
{\cal D}_{1:t} = \{(x_1, y_1), \ldots, (x_t, y_t)\}.
$$ Given $x_{t+1}$, make a prediction
$$
\hat{y}_{t+1} = \omega(x_{t+1}, {\cal D}_{1:t}).
$$ Variables $x_t$ and $y_t$ can live in different time frames. Estimating unobservable quantities of interest from the data. Predicting a measurable outcome $y_t$ Inductive bias on the observations given parameters $\theta$ and features $x_t$.
In the linear setting: $$
h(\theta, x) = \sigma(\phi_\theta(x)),
$$
with $\phi_\theta: \reals^M \to \reals$ a neural network with real-valued output unit.
Then, $p(y_t \vert \theta, x_t) = {\rm Bern}(y_t \vert h(\theta, x_t))$. Assume that observations $y_{1:t}$ are conditionally independent given $\theta$. In most cases, we cannot compute the posterior. We resort to either sample-based methods or approximations. Suppose $p(\theta) = {\cal N}(\theta \vert \mu_0, \Sigma_0)$
$$
\mu_t, \Sigma_t = {\bf D}_\text{KL}\left(
{\cal N}(\theta\,|\,\mu, \Sigma) \|
{\cal N}(\theta \vert \mu_{0}, \Sigma_{0})\,\prod_{k=1}^t p(y_k | \theta, x_k).
\right)
$$ Suppose
$p(\theta \vert {\cal D}_{1:t-1}) = {\cal N}(\theta \vert \mu_{t-1}, \Sigma_{t-1})$:
$$
\mu_t, \Sigma_t = {\bf D}_\text{KL}
\left(
{\cal N}(\theta\,|\,\mu, \Sigma) \|
{\cal N}(\theta \vert \mu_{t-1}, \Sigma_{t-1})\,p(y_t | \theta, x_t)
\right).
$$ First-order approximation of the choice of modelling function (M.1) around the previous mean $\mu_{t-1}$:
$$
h(\theta, x) \approx \bar{h}_t(\theta, x) = H_t\,(\theta_t - \mu_{t-1}) + h(\mu_{t-1}, x_t).
$$ $$
p(y_t \vert \theta, x_t) \approx
{\cal N}(y_t\,\vert\,\hat{y}_t, R_t),
$$
where
$$
\begin{aligned}
\hat{y}_t &= \mathbb{E}[\bar{h}(\theta_t, x_t) \vert {\cal D}_{1:t-1}] = h(\mu_{t-1}, x_t)\\
\end{aligned}
$$
and $R_t$ is the moment-matched variance of the observation process. When more data does not lead to better performance. Could be due to: In many situations, (M.1) and (A.1) are not enough to adapt to regime changes. where $\mu_{(r_{t-1})}, \Sigma_{(r_{t-1})}$ denotes the posterior belief using observations
from indices $t - r_t$ to $t - 1$.
$\mu_0$ and $\Sigma_0$ are pre-defined prior mean and covariance. $$
\begin{aligned}
g_t(s_{1:t}, {\cal D}_{1:t-1}) &= \mu_{(s_{1:t})},\\
G_t(s_{1:t}, {\cal D}_{1:t-1}) &= \Sigma_{(s_{1:t})},\\
\end{aligned}
$$
where $\mu_{(s_{1:t})}, \Sigma_{(s_{1:t})}$ denote the posterior beliefs using observations that are 1-valued. (A.2) An algorithm to weight over choices of $\psi_t \in \Psi_t$. For some $\psi_t$ and $p(\psi_t \vert \psi_{t-1})$, this method yields recursive update methods. Suppose
$\psi_t = \alpha_t \in \{1, \ldots, K\}$, take
$$
\nu_t(\alpha_t) = 1 - \frac{\ell(y_{t+1}, \hat{y}_{t+1}^{(\alpha_t)})}
{\sum_{k=1}^K \ell(y_{t+1}, \hat{y}_{t+1}^{(k)})},
$$
with $\ell$ a loss function (lower is better). For $\psi_t = \upsilon_t \in [0,1]$, $$
\upsilon_t^* = \argmax_{\upsilon \in [0,1]} p(y_t \vert x_t, \upsilon, {\cal D}_{1:t-1}).
$$
Then,
$$
\nu(\upsilon_t) = \delta(\upsilon_t = \upsilon_t^*).
$$ Consider three combinations algorithms Our choice of measurement model is a single hidden layer neural network
with linearised moment-matched Gaussian. In-sample hyperparameter optimisation. $$
\nu_t(r_t^{(1)}) =
\frac{p(y_t \vert r_t^{(1)}, x_t, {\cal D}_{1:t-1})\,(1 - \pi)}
{p(y_t \vert r_t^{(0)}, x_t, {\cal D}_{1:t-1})\,\pi + p(y_t \vert r_t^{(1)}, x_t, {\cal D}_{1:t-1})\,(1-\pi)}.
$$
Here, $r_{t}^{(1)} = r_{t-1} + 1$ and $r_{t}^{(0)} = 0$. Seven features: One target variable: Features are lagged by one-hour. See paper for details. We introduce a framework for Bayesian online learning in non-stationary environments (BONE).
BONE methods are written as instances of:(B)ayesian (O)nline learning in (N)on-stationary (E)nvironments
What is BONE?
Online learning and prediction
(Unsupervised) online sequential learning
(Supervised) online sequential learning (and prediction)
Bayes for sequential online learning
$$
\hat{y}_{t+1}
= \mathbb{E}[h(\theta_t; x_{t+1}) \vert {\cal D}_{1:t}]
= \int
\underbrace{h(\theta_t; x_{t+1})}_\text{measurement fn.}
\overbrace{p(\theta_t; {\cal D}_{1:t})}^\text{posterior density}
\,{\rm d}\theta_t.
$$A running example: sequential classification
Choice of measurement model (M.1)
Non-linear classification classification
Posterior over model parameters (A.1) — Bayes
Variational Bayes methods
Posterior over model parameters (A.1) — sequential Bayes
$$
p(\theta | {\cal D}_{1:t}) \propto p(\theta | {\cal D}_{1:t-1}) p(y_t \vert \theta, x_t).
$$(Recursive) variational Bayes methods
Moment-matched linear Gaussian (LG)
Example: moment-matched linear Gaussian for classification
$$
\begin{aligned}
H_t &= \nabla_\theta h(\mu_{t-1}, x_t) & \text{(Jacobian)}\\
\hat{y}_t & = h(\mu_{t-1}, x_t) & \text{ (one-step-ahead prediction)} \\
R_t &= \hat{y}_t\,(1 - \hat{y}_t) & \text{ (moment-matched variance)}\\
\Sigma_t^{-1} &= \Sigma_{t-1}^{-1} + H_t^\intercal\,R_t^{-1}\,H_t & \text{(posterior precision)}\\
{\bf K}_t &= \Sigma_{t}\,H_t\,{R}_t^{-1} & \text{(Gain matrix)}\\
\hline
\mu_t &\gets \mu_{t-1} + {\bf K}_t\,(y_t - \hat{y}_t) & \text{(update)}
\end{aligned}
$$Sequential Online classification
Online learning — (M.1) and (A.1)
Changes in the data-generating process
Non-stationary moons dataset
The full dataset (without knowledge of the task boundaries)
Constant moment-matched LG updates — non-stationary moons
Recap: tools for sequential learning
Tackling non-stationarity
Tracking regimes through an auxiliary variable (M.2)
Runlenght (RL)
Runlenght with changepoint count (RLCC)
Changepoint location (CPL)
Changepoint location (CPL) alt.
Changepoint probability (CPP)
Mixture of experts (ME)
Constant (C)
Summary of auxiliary variables
(M.3) Modifying beliefs based on regime
$$
\underbrace{\lambda(\theta; \psi_t, {\cal D}_{1:t})}_{\text{posterior}}
\propto
\underbrace{\tau(\theta\,\vert\,\psi_t, {\cal D}_{1:t-1})}_\text{parametrised by $\psi_t$}\,
\overbrace{p(y_t\,\vert\,\theta, x_t)}^\text{parametrised by $\theta$}
$$Choice of conditional prior (M.3) — the Gaussian case
$$
\tau(\theta_t \vert \psi_t,\, {\cal D}_{1:t-1}) =
{\cal N}\big(\theta_t\vert g_{t}(\psi_t, {\cal D}_{1:t-1}), G_{t}(\psi_t, {\cal D}_{1:t-1})\big),
$$C-Static
— constant update with static auxvarRL-PR
— runlength with prior reset
$$
\begin{aligned}
g_t(r_t, {\cal D}_{1:t-1}) &= \mu_0\,\mathbb{1}(r_t = 0) + \mu_{(r_{t-1})}\mathbb{1}(r_t > 0),\\
G_t(r_t, {\cal D}_{1:t-1}) &= \Sigma_0\,\mathbb{1}(r_t = 0) + \Sigma_{(r_{t-1})}\mathbb{1}(r_t > 0),\\
\end{aligned}
$$CPP-OU
— changepoint probability with Ornstein-Uhlenbeck processCPL-sub
— changepoint location with subset of dataC-ACI
— constant with additive covariance inflationTools for sequential learning (conditioned on regime)
A prediction conditioned on $\psi_t$
$$
\hat{y}_{t+1}^{(\psi_t)}
= \mathbb{E}_{\lambda_t}[h(\theta_t;\, x_{t+1}) \vert \psi_t]
:= \int h(\theta_t;\, x_{t+1})\,\lambda(\theta_t;\,\psi_t, {\cal D}_{1:t})d \theta_t\,.
$$Aggregate predictions
$$
\begin{aligned}
\hat{y}_{t+1}
&= \int\left(\int h(\theta_t;\, x_{t+1})\,\lambda_t( \theta_t;\,\psi_t, {\cal D}_{1:t})\, d \theta_t\right)\,\nu_t(\psi_t)\,d\psi_t\\
&= \int \hat{y}_{t+1}^{(\psi_t)}\,\nu_t(\psi_t)\,d\psi_t\\
&=: \mathbb{E}_{\nu_t}\Big[\, \mathbb{E}_{\lambda_t}[h(\theta_t;\, x_{t+1}) \vert \psi_t ] \,\Big]\,.
\end{aligned}
$$(A.2) Weighting function for regimes
(A.2) The recursive Bayesian choice
$$
\begin{aligned}
\nu_t(\psi_t)
&= p(\psi_t \vert {\cal D}_{1:t})\\
&=
p(y_t \vert x_t, \psi_t, {\cal D}_{1:t-1})\,
\int_{\psi_{t-1} \in \Psi_{t-1}}
p(\psi_{t-1} \vert {\cal D}_{1:t-1})\,
p(\psi_t \vert \psi_{t-1}, {\cal D}_{1:t-1}) d \psi_{t-1},
\end{aligned}
$$(A.2) A loss-based approach
(A.2) An empirical-Bayes (point-estimate approach)
BONE — Bayesian online learning in non-stationary environments
Back to the non-stationary moons example
RL[1]-PR
.CPP-OU
.C-ACI
.RL[1]-PR
CPP-OU
C-ACI
Sequential classification — comparison
Unified view of examples in the literature
Creating a new method
Choice of (M.2)
RL[1]
) as choice auxiliary variable.RL[1]-OUPR — choice of (M.2) and (M.3)
$$
g_t(r_t, {\cal D}_{1:t-1}) =
\begin{cases}
\mu_0\,(1 - \nu_t(r_t)) + \mu_{(r_t)}\,\nu_t(r_t) & \nu_t(r_t) > \varepsilon,\\
\mu_0 & \nu_t(r_t) \leq \varepsilon,
\end{cases}
$$
$$
G_t(r_t, {\cal D}_{1:t-1}) =
\begin{cases}
\Sigma_0\,(1 - \nu_t(r_t)^2) + \Sigma_{(r_t)}\,\nu_t(r_t)^2 & \nu_t(r_t) > \varepsilon,\\
\Sigma_0 & \nu_t(r_t) \leq \varepsilon.
\end{cases}
$$RL[1]-OUPR — choice of (A.2)
Experiment: hourly electricity load
Experiment: hourly electricity load
Electricity forecasting during normal times
Electricity forecasting before and after Covid shock
Electricity forecasting
Electricity forecasting results (2018-2020)
Experiment — heavy-tailed linear regression
Heavy tailed linear regression (sample run)
Even more experiments!
Conclusion
Three modelling choices
Two algorithmic choices