runlmc.models.interpolated_llgp module¶

class runlmc.models.interpolated_llgp.InterpolatedLLGP(Xs, Ys, normalize=True, lo=None, hi=None, m=None, name='lmc', metrics=False, prediction='on-the-fly', max_procs=None, trace_iterations=15, tolerance=0.0001, functional_kernel=None)[source]¶

Bases: runlmc.models.multigp.MultiGP

The main class of this package, InterpolatedLLGP implements linearithmic Gaussian Process learning in the multi-output case. See the paper on arxiv.

Upon construction, this class assumes ownership of its parameters and does not account for changes in their values.

For a dataset of inputs Xs across multiple outputs Ys, let \(X\) refer to the concatenation of Xs. According to the functional specification of the LMC kernel by functional_kernel (see documentation in runlmc.lmc.functional_kernel.FunctionalKernel), we can create the covariance matrix for a multi-output GP model applied to all pairs of \(X\), resulting in \(K_{X,X}\).

The point of this class is to vary hyperparameters of \(K\), the FunctionalKernel given by functional_kernel, until the model log likelihood is as large as possible.

This class uses the SKI approximation to do this efficiently, which shares a single grid \(U\) as the input array for all the outputs. Then, \(K_{X,X}\) is interpolated from the approximation kernel \(K_{\text{SKI}}\), as directed in Thoughts on Massively Scalable Gaussian Processes by Wilson, Dann, and Nickisch. This is done with sparse interpolation matrices \(W\).

\[K_{X,X}\approx K_{\text{SKI}} = W K_{U,U} W^\top + \boldsymbol\epsilon I\]

Above, \(K_{U,U}\) is a structured kernel over a grid \(U\). This grid is specified by lo,hi,m.

The functionality for the various prediction modes is summarized below.

‘on-the-fly’ - Use matrix-free inversion to compute the covariance for the entire set of points on which we’re predicting. This means that variance prediction take \(O(n \log n)\) time per test point, where Xs has \(n\) datapoints total. This should be preferred for small test sets.
‘precompute’ - Compute an auxiliary predictive variance matrix for the grid points, but then cheaply re-use that work for prediction. This is an up-front \(O(n^2 \log n)\) payment for \(O(1)\) predictive variance afterwards per test point. This is not available if using split kernels (i.e., different active dimensions for different kernels).
‘exact’ - Use the exact cholesky-based algorithm (not matrix free), \(O(n^3)\) runtime up-front and then \(O(n^2)\) per query.

Note ‘on-the-fly’, ‘precompute’ can be parallelized by the number of test points and training points, respectively.

Parameters:	Xs – input observations, should be a list of numpy arrays, where each numpy array is a design matrix for the inputs to output \(i\). If the \(i\)-th input has \(n_i\) data points, then this matrix can be \(n_i\) or \(n_i\times P\) shape for input dimension \(P\), with the former re-interpreted as \(P=1\). Ys – output observations, this must be a list of one-dimensional numpy arrays, matching up with the number of rows in Xs. normalize – optional normalization for outputs Ys. Prediction will be un-normalized. lo – lexicographically smallest point in inducing point grid used (by default, a bit less than the minimum of input). For multidimensional inputs this should be a vector. hi – lexicographically largest point in inducing point grid used (by default, a bit more than the maximum of input). For multidimensional inputs this should be a vector. m – number of inducing points to use. For multidimensional inputs this should be a vector indicating how many grid points there should be along each dimension. The total number of points used is then np.prod(m). By default, m is a constant array of dimension \(P\), the input dimension, of size equal to the average input sequence length. name (str) – metrics – whether to record optimization metrics during optimization (runs exact solution alongside this one, may be slow). prediction – one of ‘matrix-free’, ‘on-the-fly’, ‘precompute’, ‘exact’, ‘sample’. max_procs – maximum number of processes to use for parallelism, defaults to cpu count. functional_kernel – a `runlmc.lmc.functional_kernel.FunctionalKernel` determining \(K\). trace_iterations – number of iterations to be used in approximate trace algorithm.
Raises:	`ValueError` if Xs and Ys lengths do not match.
Raises:	`ValueError` if normalization if any Ys have no variance or values in Xs have multiple identical values.
Variables:	metrics – the `runlmc.lmc.metrics.Metrics` instance associated with the model

EVAL_NORM = inf¶

K()[source]¶

Warning

This generates the entire kernel, a quadratic operation in memory and time.

Returns:	\(K_{\text{SKI}}\), the approximation of the exact kernel.

log_det_K()[source]¶

Returns:	an upper bound of the approximate log determinant, uses \(K_\text{SKI}\) to find an approximate upper bound for \(\log\det K_{ ext{exact}}\)

log_likelihood()[source]¶: The log marginal likelihood of the model, \(p(\mathbf{y})\), this is the objective function of the model being optimised

normal_quadratic()[source]¶

If the flattened (Stacked)outputs are written as \(\textbf{y}\), this returns \(\textbf{y}^\top K_{\text{SKI}}^{-1}\textbf{y}\).

Returns:	the normal quadratic term for the current outputs Ys.

optimize(**kwargs)[source]¶

Optimize the model using log_likelihood() with a gradient descent method that involves the priors.

kwargs are passed to the optimizer. See parameters for handled keywords.

Parameters:	optimizer – A `paramz.optimization.Optimizer`. Pre-built ones available in `runlmc.models.optimization`.

parameters_changed()[source]¶

This method is called automatically when linked parameters change, which the may during the optimization process.

Classes should update their posterior information, log likelihood, and gradients when this happens, such that _raw_predict(), log_likelihood(), and gradient() are consistent with the new parameters.