Consider a regression problem with a set of data

\[ \begin{align}\begin{aligned}\set{D} = \setbuilder{(\vec{x}_i, y_i), i \in 1, \dots, n}\\which is composed of :math:`n` pairs of inputs, :math:`\vec{x}_i`,\end{aligned}\end{align} \]

which are vectors which describe the location of the datum in parameter space, which are the inputs for the problem, and \(y_i\), the outputs. The outputs may be noisy; in this work I will only consider situations where the noise is additive and Gaussian, so

where \(\sigma\) is the standard deviation of the noise, and \(f\) is the (latent) generating function of the data.

This regression problem can be addressed using *Gaussian processes*:

A gls:gaussian-process is a collection of random variables, any finite number of which have a joint Gaussian distribution cite:gpr.book.rw.

Where it is more conventional to consider a prior over a set of, for example, real values, such as a normal distribution, the Gaussian process forms a prior over the functions, \(f\) from equation ref:eq:gp:additive-noise, which might form the regression fit to any observed data. This assumes that the values of the function \(f\) behave as

where \(\mat{K}\) is the covariance matrix of \(\vec{x_1}\) and
\(\vec{x_2}\), which can be calculated with reference to some
*covariance function*, \(k\), such that
\(K_{ij} = k(\vec{x}_i, \vec{x}_j)\). Note that I have assumed that
the abbr:gp is a *zero-mean* process; this assumption is frequent within
the literature. While this prior is initially untrained it still
contains information about our preconceptions of the data through the
form of the covariance function. For example, whether or not we expect
the fit to be smooth, or periodic. Covariance functions will be
discussed in greater detail in section ref:sec:gp:covariance.

By providing training data we can use Bayes theorem to update the Gaussian process, in the same way that the posterior distribution is updated by the addition of new data in a standard Bayesian context, and a posterior on the set of all possible functions to fit the data is produced. Thus, for a vector of test values of the generating function \(\vec{f}_\star\), the joint posterior \(p(\vec{f}, \vec{f}_* | \vec{y})\), given the observed outputs \(\vec{y}\) can be found by updating the abbr:gp prior on the training and test function values \(p(\vec{f}, \vec{f}_*)\) with the likelihood \(p(\vec{y}|\vec{f})\):

Finally the (latent) training-set function values, \(\vec{f}\) can be marginalised out:

We can take the mean of this posterior in the place of the ``best fit line’’ which other techniques produce, and then use the variance to produce an estimate of the uncertainty of the prediction.

Both the prior \(p(\vec{f}, \vec{f}_*)\) and the likelihood \(p(\vec{y}|\vec{f})\) are Gaussian:

with

and \(\mat{I}\) the identity matrix.

This leaves the form of the marginalised posterior being analytical:

Figures ref:fig:gp:training-data to ref:fig:gp:posterior-best show visually how a one-dimensional regressor can be created using an abbr:gp method, starting from a abbr:gp prior and (noisy) data.

The mean and variance of this posterior distribution can be used to form a regressor for the data, \(\set{D}\), with the mean taking the role of a ``line-of-best-fit’’ in conventional regression techniques, while the variance describes the goodness of that fit.

A graphical model of a abbr:gp is shown in figure ref:fig:gp:chain-diagram which illustrates an important property of the abpl:gp model: the addition (or removal) of any input point to the abbr:gp does not change the distribution of the other variables. This property allows outputs to be generated at arbitrary locations throughout the parameter space.

Gaussian processes trained with \(N\) training data require the ability to both store and invert an \(N\times N\) matrix of covariances between observations; this can be a considerable computational challenge.

Gaussian processes can be extended from the case of a single-dimensional input predicting a single-dimensional output to the ability to predict a multi-dimensional output from a multi-dimensional input cite:2011arXiv1106.6251A,Alvarez2011a,Bonilla2007.

The covariance function defines the similarity of a pair of data points, according to some relationship with suitable properties. The similarity of input data is assumed to be related to the similarity of the output, and therefore the more similar two inputs are the more likely their outputs are to be similar.

As such, the form of the covariance function represents prior knowledge about the data, and can encode understanding of effects such as periodicity within the data.

A stationary covariance function is a function \(f(\vec{x} - \vec{x}')\), and which is thus invariant to translations in the input space.

If a covariance function is a function of the form \(f(|\vec{x} - \vec{x}'|)\) then it is isotropic, and invariant under all rigid motions.

A covariance function which is both stationary and isotropic has the
property that it can be expressed as a function of a single variable,
\(r = | \vec{x} - \vec{x}' |\) is known as a abbr:rbf. Functions of
the form \(k : (\vec{x}, \vec{x}') \to \mathbb{C}\), for two vectors
\(\vec{x}, \vec{x}' \in \mathcal{X}\) are often known as *kernels*,
and I will frequently refer interchangably to covariance functions and
kernels where the covariance function has this form.

For a set of points \(\setbuilder{ \vec{x}_{i} | i = 1, \dots, n }\)
a kernel, \(k\) can be used to construct the gram matrix,
\(K_{i,j} = k(x_{i}, x_{j})\). If the kernel is also a covariance
function then \(K\) is known as a *covariance matrix*.

For a kernel to be a valid covariance function for a abbr:gp it must produce a positive semidefinite covariance matrix \(\mat{K}\). Such a matrix, \(\mat{K} \in \mathbb{R}^{n \times n}\) must satisfy \(\vec{x}^{\transpose} \mat{K} \vec{x} \geq 0\) for all \(\vec{x} \in \mathbb{R}^{n}\).

One of the most frequently encountered covariance functions in the
literature is the abbr:se covariance functions cite:gpr.book.rw. Perhaps
as a result of its near-ubiquity this kernel is known under a number of
similar, but confusing names (which are often inaccurate). These include
the *exponential quadratic*, *quadratic exponential*, *squared
exponential*, and even *Gaussian* covariance function.

The reason for this is its form, which closely resembles that of the Gaussian function:

for \(r\) the Euclidean distance of a datum from the centre of the parameter space, and \(l\) is a scale factor associated with the axis along which the data are defined.

The squared exponential function imposes strong smoothness constraints on the model, as it is infinitely differentiable.

The scale factor, \(l\) in ref:eq:gp:kernels:se, also known as its
*scale-length* defines the size of the effect within the process. This
characteristic length-scale can be understood cite:adler1976,gpr.book.rw
in terms of the number of times the abbr:gp should cross some given
level (for example, zero). Indeed, for a abbr:gp with a covariance
function \(k\) which has well-defined first and second derivatives
the expected number of times, \(N_{u}\) the process will cross a
value \(u\) is

A zero-mean abbr:gp which has an abbr:se covariance structure will then cross zero \(1/(2 \pi l)\) times on average.

Examples of the squared exponential covariance function, and of draws from a Gaussian process prior which uses this covariance function are plotted in figure ref:fig:gp:covariance:overviews:se for a variety of different scale lengths.

For data which is not generated by a smooth function a suitable covariance function may be the exponential covariance function, \(k_{\mathrm{EX}}\), which is defined

where \(r\) is the pairwise distance between data and \(l\) is a length scale, as in equation ref:eq:gp:kernels:se.

Examples of the exponential covariance function, and of draws from a Gaussian process prior which uses this covariance function are plotted in figure ref:fig:gp:covariance:overviews:ex for a variety of different scale lengths.

For data generated by functions which are smooth, but not necessarily infinitely differentiable we may turn to the Matérn family of covariance functions, which take the form

for \(K_{\nu}\) the modified Bessel function of the second kind, and \(\Gamma\) the gamma function. As with the previous two covariance functions \(l\) is a scale length parameter, and \(r\) the distance between two data. A abbr:gp which has a Matérn covariance function will be \((\lceil x \rceil - 1)\)-times differentiable.

While determining an appropriate value of \(\nu\) during the
training of the abbr:gp is possible, it is common to select a value *a
priori* for this quantity. \(\nu=3/2\) and \(\nu=5/2\) are
common choices as \(K_{\nu}\) can be determined simply, and the
covariance functions are analytic.

The case with \(\nu=3/2\), commonly referred to as a Matérn-\(3/2\) kernel then becomes

Examples of this covariance function, and example draws from a abbr:gp using it as a covariance function are plotted in figure ref:fig:gp:kernels:m32.

Similarly, the Matérn-\(5/2\) is the case where \(\nu = 5/2\), taking the form

Again, examples of this covariance function, and example draws from a abbr:gp using it as a covariance function are plotted in figure ref:fig:gp:kernels:m52.

Data may also be generated from functions with variation on multiple
scales. One approach to modelling such data is to use a abbr:gp with
**rational quadratic** covariance. This covariance function represents a
scale mixture of abbr:rbf covariance functions, each with a different
characteristic length scale. The rational quadratic covariance function
is defined as

where \(\alpha\) is a parameter which controls the weighting of small-scale compared to large-scale variations, and \(l\) and \(r\) are the overall length scale of the covariance and the distance between two data respectively. Examples of this function, at a variety of different length scales and \(\alpha\) values, and draws from abpl:gp which use these functions are plotted in figure ref:fig:gp:kernels:rq.

This summary of potential covariance functions for use with a abbr:gp is far from complete (see cite:gpr.book.rw for a more detailed list). However, these four can be used or combined to produce highly flexible regression models, as they can be added and multiplied as normal functions.