Introduction to GP Regression

Definition of Gaussian Process

A Gaussian Process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. It is completely specified by its mean function $μ(x)$ and covariance function $k(x,x’)$, also known as the kernel function.

$f(x)\sim GP(μ(x), k(x,x'))$

Where:

$μ(x) = E[f(x)]$ is the mean function
$k(x,x’) = E[(f(x) - μ(x))(f(x’) - μ(x’))]$ is the covariance function

For a random process, we usually introduce a time series to interpret the meaning of different states (e.g., a Poisson Process, which is a counting process dependent on time). However, a more formal way to understand processes is to define them on a space. For instance, time steps are performed in $Z$ space, and consequently, the GP is a type of process defined on $R^d$.

Every point in $R^d$ represents a state corresponding to a unique Normal Distribution. The key rule for this space is that when we combine all these states, their joint distribution forms a Multivariate Normal Distribution with a unique covariance function $K: R^d \rightarrow R \times R$

Background

Unlike Machine Learning Problem, once we have the dataset with vector $\mathbf{x}$ and target variable $y$, we do not need to split the dataset since the model assumption indicts the difference between training data and prediction data, following by the distribution:

$\begin{bmatrix} y \\ y^* \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu(\mathbf{X}) \\ \mu(\mathbf{X}^*) \end{bmatrix}, \begin{bmatrix} K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 \mathbf{I} & K(\mathbf{X}, \mathbf{X}^*) \\ K(\mathbf{X^*}, \mathbf{X}) & K(\mathbf{X^*}, \mathbf{X^*}) \end{bmatrix} \right)$

Note:

$\text{Cov}(\mathbf{y}, \mathbf{y}^*) = \text{Cov}(\mathbf{f(X) + \epsilon}, \mathbf{f^*(X)}) = \text{Cov}(f, f^*) = K(\mathbf{X}, \mathbf{X}^*)$

Prior and Posterior distribution

For the convenience of illustration, the prior distribution is commonly set up by:

$\mu(X)= 0 \;\ k(x,x') = \Sigma \text{ which is known kernel}$

The next step is to find the posterior distribution of $y^ | x, y, x^$, we consider the general form:

$\mathbf{z} = \begin{bmatrix} \mathbf{a} \\ \mathbf{b} \end{bmatrix} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma),\quad \boldsymbol{\mu} = \begin{bmatrix} \boldsymbol{\mu}_a \\ \boldsymbol{\mu}_b \end{bmatrix},\quad \Sigma = \begin{bmatrix} A & C \\ C^\top & B \end{bmatrix}$ $\begin{align*} p(\mathbf{b} \mid \mathbf{a}) &= \frac{p(\mathbf{a}, \mathbf{b})}{p(\mathbf{a})} \sim \frac{\mathcal{N}(\boldsymbol{\mu}, \Sigma)}{\mathcal{N}(\mathbf{\mu}_a, A)} \\ &= \frac{\frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\left\{ -\frac{1}{2} (\mathbf{z} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{z} - \boldsymbol{\mu}) \right\}}{\frac{1}{(2\pi)^{n/2} |A|^{1/2}} \exp\left\{ -\frac{1}{2} (\mathbf{z} - \mu_a)^\top A^{-1} (\mathbf{z} - \mu_a) \right\}} \end{align*}$

Since we known that the division of Normal Distribution is also Normal, and the dominator is constant since the data is given, we analysis $\mathcal{N}(\boldsymbol{\mu}, \Sigma)$ directly.

Let $\delta \mathbf{a} := \mathbf{a} - \boldsymbol{\mu}_a,\quad \delta \mathbf{b} := \mathbf{b} - \boldsymbol{\mu}_b$ and we have:

$\begin{align*} p(\mathbf{b} \mid \mathbf{a}) &\propto \exp\left\{ -\frac{1}{2} (\mathbf{z} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{z} - \boldsymbol{\mu}) \right\} \\ &=\exp\left\{-\frac{1}{2}\begin{bmatrix}\delta \mathbf{a} \\\delta \mathbf{b}\end{bmatrix}^\top\Sigma^{-1}\begin{bmatrix}\delta \mathbf{a} \\\delta \mathbf{b} \end{bmatrix}\right\} \end{align*}$

Given that for block matrix, we have:

$\Sigma = \begin{bmatrix} A & C \\ C^\top & B \end{bmatrix} \Rightarrow M:= \Sigma^{-1} = \begin{bmatrix} A^{-1} + A^{-1} C S^{-1} C^\top A^{-1} & -A^{-1} C S^{-1} \\ S^{-1} C^\top A^{-1} & S^{-1} \end{bmatrix} \\ \text{ where: } S := B - C^\top A^{-1} C \quad \text{(Schur complement)}$

We define:

$Q(\mathbf{a}, \mathbf{b}) = \begin{bmatrix} \delta \mathbf{a} \\ \delta \mathbf{b} \end{bmatrix}^\top \begin{bmatrix} A & C \\ C^\top & B \end{bmatrix}^{-1} \begin{bmatrix} \delta \mathbf{a} \\ \delta \mathbf{b} \end{bmatrix}$

and complete the square：

$\begin{align*}Q(\mathbf{a}, \mathbf{b}) &= \delta \mathbf{a}^\top M_{aa} \delta \mathbf{a} + 2 \delta \mathbf{a}^\top M_{ab} \delta \mathbf{b} + \delta \mathbf{b}^\top M_{bb} \delta \mathbf{b} \\ &= \delta \mathbf{b}^\top M_{bb} \delta \mathbf{b} + 2 \delta \mathbf{b}^\top M_{ba} \delta \mathbf{a} + \text{(constant)} \\ &= (\delta \mathbf{b} + M_{bb}^{-1} M_{ba} \delta \mathbf{a})^\top M_{bb} (\delta \mathbf{b} + M_{bb}^{-1} M_{ba} \delta \mathbf{a}) + \text{(constant)} \end{align*}$

which means:

$p(\mathbf{b} \mid \mathbf{a}) \propto \exp[-\frac{1}{2}Q(\mathbf{a}, \mathbf{b}))]\Rightarrow \mathbf{b} \mid \mathbf{a} \sim \mathcal{N}(\mu_n, \Sigma_n) \\ \mu_n = \mu_b + C^\top A^{-1} (\mathbf{a} - \mu_a) \quad ; \quad \Sigma_n = B - C^\top A^{-1} C$

Given with the data, now we obtain the posterior distribution for the predictive points.

Optimization

Our key is to find the correct kernel expression, we can introduce marginal likelihood function and try to find the maximum value of it. We start from the definition:

$p(\mathbf{y}\mid X, \theta) = \int_{\mathcal{R}^n} p(\mathbf{y} \mid \mathbf{f}) p(\mathbf{f}\mid X)d\mathbf{f}$

We consider f mentioned in the integral is the arbitrary expectation in $R^n$ space. so that $p(y|f)$ can be interpret as: under such expectation, how likely will the true observation happens, which is a normal distribution as the model assumption, and the term $p(f|X)$ means the prior behavior of $f$, depending on how we define the prior distribution. In this case, it is $\mathcal{N}(\mathbf{\mu}, K)$ and $\mathbf{\mu} = \mathbf{0}$ .

Since both distribution is Normal, the convolution of 2 normal distribution is still Normal. We have

$\mathbf{y} | X, \theta \sim \mathcal{N}(0， \sigma^2\mathbf{I}+K)$

We use the Negative Log Marginal Likelihood (NLML) as the objective function to optimize model hyperparameters because it provides a principled and probabilistic way to learn the best settings for the kernel (e.g., length-scales, variances) and noise level.

$\log p(\mathbf{y} \mid \mathbf{X}, \theta) = -\frac{1}{2} \mathbf{y}^\top (\mathbf{K}\theta + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} - \frac{1}{2} \log |\mathbf{K}\theta + \sigma_n^2 \mathbf{I}| - \frac{n}{2} \log 2\pi$

And

$\text{NLML}(\theta) = \frac{1}{2} \mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y} + \frac{1}{2} \log |\mathbf{K}_y| + \frac{n}{2} \log 2\pi$

with $\mathbf{K}_y = \mathbf{K}_\theta + \sigma_n^2 \mathbf{I}$. Notice that NLML is gradient based therefore we can use GD, Adam and other common methods to solve the minimum solution.

Workflow

Data process and initial part:
1. Data Cleaning and Normalize
2. Define the kernel function and initial the parameters to be optimized
3. Define NLML
Training Loop in given epochs number
1. Calculate the NLML under current parameters
2. Use Gradient-based Method with a learning rate
3. Update the parameters
After Training
1. Calculate the posterior distribution and sample the result
2. Present the result and Validation