Multivariate normal covariance matrices and the cholesky decomposition

This post is mainly some notes about linear algebra, the cholesky decomposition, and a way of parametrising the multivariate normal which might be more efficient in some cases. In general it is best to use existing implementations of stuff like this - this post is just a learning exercise.

The multivariate normal covariance matrix \(\Sigma\) is symmetric positive semi-definite which means that it can be written as:

\begin{equation*} \Sigma = L L^T \end{equation*}

where \(L\) is lower triangular. This is known as the Cholesky decomposition and is available in any half decent linear algebra library, for example numpy.linalg.cholesky in python or chol in R.

That means that one easy way to create a positive semi-definite matrix is to start with \(L\):

import numpy as np

np.random.seed(42)

L = np.tril(np.random.normal(0.0, 1.0, [4, 4]))
Sigma = L @ L.T
chol = np.linalg.cholesky(Sigma)

print("L:\n{}".format(L))
print("chol:\n{}".format(chol))

L:
[[ 0.49671415  0.          0.          0.        ]
 [-0.23415337 -0.23413696  0.          0.        ]
 [-0.46947439  0.54256004 -0.46341769  0.        ]
 [ 0.24196227 -1.91328024 -1.72491783 -0.56228753]]
chol:
[[ 0.49671415  0.          0.          0.        ]
 [-0.23415337  0.23413696  0.          0.        ]
 [-0.46947439 -0.54256004  0.46341769  0.        ]
 [ 0.24196227  1.91328024  1.72491783  0.56228753]]

But we see here that L and chol are not the same - some of the signs have switched. To ensure that this doesn't happen (i.e. there is a unique solution), we need to ensure that the diagonal of \(L\) is positive (see the covariance matrices section of the stan manual):

np.random.seed(42)

d = 4
L = np.tril(np.random.normal(0.0, 1.0, [d, d]))
np.fill_diagonal(L, np.abs(np.diag(L)))
Sigma = L @ L.T
chol = np.linalg.cholesky(Sigma)

print("L:\n{}".format(L))
print("chol:\n{}".format(chol))

L:
[[ 0.49671415  0.          0.          0.        ]
 [-0.23415337  0.23413696  0.          0.        ]
 [-0.46947439  0.54256004  0.46341769  0.        ]
 [ 0.24196227 -1.91328024 -1.72491783  0.56228753]]
chol:
[[ 0.49671415  0.          0.          0.        ]
 [-0.23415337  0.23413696  0.          0.        ]
 [-0.46947439  0.54256004  0.46341769  0.        ]
 [ 0.24196227 -1.91328024 -1.72491783  0.56228753]]

In addition to allowing us to easily create random covariance matrices, the cholesky parametrisation of the multivariate normal PDF is much more efficient. The typical PDF you see is:

\begin{equation*} p(y | \mu, \Sigma) = \frac{1}{(2 \pi)^{d / 2} |\Sigma|^{1/2}} e^{-\frac{1}{2}(y - \mu)^T \Sigma^{-1} (y - \mu)} \end{equation*}

where \(d\) is the dimension of the random vector. Implementing this directly gives something like:

import scipy.stats as stats

def multivariate_normal_pdf(y, mu, Sigma):
    d = np.shape(mu)[0]
    sqrt_det = np.sqrt(np.linalg.det(Sigma))
    two_pi_power = np.power((2.0 * np.pi), d / 2.0)
    y_minus_mu = y - mu
    exp = np.exp(-0.5 * y_minus_mu.T @ np.linalg.inv(Sigma) @ y_minus_mu)
    return 1.0 / (two_pi_power * sqrt_det) * exp

y = np.random.normal(0.0, 0.1, [d, 1])
mu = np.random.normal(0.0, 0.1, [d, 1])

me = multivariate_normal_pdf(y, mu, Sigma)[0, 0]
scipy = stats.multivariate_normal.pdf(np.squeeze(y), np.squeeze(mu), Sigma)

print("me:    {:0.8f}".format(me))
print("scipy: {:0.8f}".format(scipy))

me:    0.10220544
scipy: 0.10220544

However if we work with the cholesky parameterisation (i.e. use \(L\)) we firstly have:

\begin{equation*} |\Sigma| = |L L^T| = |L| |L^T| = |L|^2 = (\prod_{i=1}^{d} L_{i, i})^2 \end{equation*}

since for square matrices \(A\) and \(B\):

\(|A B| = |A| |B|\)
\(|A^T| = |A|\)
for a triangular matrix \(L\), the determinant is the product of the diagonal entries

see properties of the determinant. Additionally, the term in the PDF is \(|\Sigma|^{1/2}\), so in the PDF we need only calculate \(|\Sigma|^{1/2} = \prod_{i=1}^{d} L_{i, i}\). In code the square root of the determinant is:

sqrt_Sigma_det = np.sqrt(np.linalg.det(Sigma))
sqrt_L_det = np.prod(np.diag(L))

print("|Sigma|^0.5 = {:0.8f}".format(sqrt_Sigma_det))
print("|L|         = {:0.8f}".format(sqrt_L_det))

|Sigma|^0.5 = 0.03030453
|L|         = 0.03030453

We secondly have:

\begin{equation*} \Sigma^{-1} = (L L^T)^{-1} = L^{-T} L^{-1} \end{equation*}

since for invertible square matrices \(A\) and \(B\): \((A B)^{-1} = B^{-1} A^{-1}\) see properties of invertible matrix. Thus:

\begin{equation*} (y - \mu)^T \Sigma^{-1} (y - \mu) = (y - \mu)^T L^{-T} L^{-1} (y - \mu) = (L^{-1}(y - \mu))^T L^{-1} (y - \mu) \end{equation*}

since \((AB)^T = B^T A^T\) and \((A^T)^{-1} = (A^{-1})^T = A^{-T}\) - see properties of matrix transpose. This term is the square of the Mahalanobis distance, which I learned from looking at the source code in the pytorch implementation.

To calculate the \(L^{-1} (y - \mu)\) term, we can use a lapack solver which exploits that \(L\) is triangular. The lapack solver dtrtrs finds \(x\) in \(A x = b\) exploiting that \(A\) is triangular (see e.g. the intel docs and this netlib table), so if we set \(A = L^{-1}\) and \(b = y - \mu\) then it should find the solution \(x = L^{-1} (y - \mu)\) since:

\begin{equation*} L x = (y - \mu) \implies x = L^{-1} (y - \mu) \end{equation*}

Considering just this part in code gives:

import scipy.linalg.lapack as lapack

y_minus_mu = y - mu

L_inv_y_minus_mu, _ = lapack.dtrtrs(L, y_minus_mu, lower=True)
L_mahalanobis_squared = L_inv_y_minus_mu.T @ L_inv_y_minus_mu

Sigma_mahalanobis_squared = y_minus_mu.T @ np.linalg.inv(Sigma) @ y_minus_mu

print("Sigma mahalanobis^2 = {:0.8f}".format(Sigma_mahalanobis_squared[0, 0]))
print("L mahalanobis^2     = {:0.8f}".format(L_mahalanobis_squared[0, 0]))

Sigma mahalanobis^2 = 4.20294853
L mahalanobis^2     = 4.20294853

Putting it all together, a cholesky parametrised implementation might look something like:

def multivariate_normal_cholesky_pdf(y, mu, L):
    d = np.shape(mu)[0]
    sqrt_det = np.prod(np.diag(L))
    two_pi_power = np.power((2.0 * np.pi), d / 2.0)
    L_inv_y_minus_mu, _ = lapack.dtrtrs(L, y - mu, lower=True)
    mahalanobis_squared = L_inv_y_minus_mu.T @ L_inv_y_minus_mu
    return 1.0 / (two_pi_power * sqrt_det) * np.exp(-0.5 * mahalanobis_squared)

me_again = multivariate_normal_cholesky_pdf(y, mu, L)[0, 0]

print("me again:    {:0.8f}".format(me_again))
print("scipy again: {:0.8f}".format(scipy))

me again:    0.10220544
scipy again: 0.10220544

So there we go, we even had a wee look at lapack in the end!