This post is mainly some notes about linear algebra, the cholesky decomposition, and a way of parametrising the multivariate normal which might be more efficient in some cases. In general it is best to use existing implementations of stuff like this - this post is just a learning exercise.
The multivariate normal covariance matrix \(\Sigma\) is symmetric positive semi-definite which means that it can be written as:
\begin{equation*} \Sigma = L L^T \end{equation*}where \(L\) is lower triangular. This is known as the Cholesky decomposition and is available in any half decent linear algebra library, for example numpy.linalg.cholesky in python or chol in R.
That means that one easy way to create a positive semi-definite matrix is to start with \(L\):
import numpy as np np.random.seed(42) L = np.tril(np.random.normal(0.0, 1.0, [4, 4])) Sigma = L @ L.T chol = np.linalg.cholesky(Sigma) print("L:\n{}".format(L)) print("chol:\n{}".format(chol))
L: [[ 0.49671415 0. 0. 0. ] [-0.23415337 -0.23413696 0. 0. ] [-0.46947439 0.54256004 -0.46341769 0. ] [ 0.24196227 -1.91328024 -1.72491783 -0.56228753]] chol: [[ 0.49671415 0. 0. 0. ] [-0.23415337 0.23413696 0. 0. ] [-0.46947439 -0.54256004 0.46341769 0. ] [ 0.24196227 1.91328024 1.72491783 0.56228753]]
But we see here that L
and chol
are not the same - some of the signs have
switched. To ensure that this doesn't happen (i.e. there is a unique solution),
we need to ensure that the diagonal of \(L\) is positive (see the covariance
matrices section of the stan manual):
np.random.seed(42) d = 4 L = np.tril(np.random.normal(0.0, 1.0, [d, d])) np.fill_diagonal(L, np.abs(np.diag(L))) Sigma = L @ L.T chol = np.linalg.cholesky(Sigma) print("L:\n{}".format(L)) print("chol:\n{}".format(chol))
L: [[ 0.49671415 0. 0. 0. ] [-0.23415337 0.23413696 0. 0. ] [-0.46947439 0.54256004 0.46341769 0. ] [ 0.24196227 -1.91328024 -1.72491783 0.56228753]] chol: [[ 0.49671415 0. 0. 0. ] [-0.23415337 0.23413696 0. 0. ] [-0.46947439 0.54256004 0.46341769 0. ] [ 0.24196227 -1.91328024 -1.72491783 0.56228753]]
In addition to allowing us to easily create random covariance matrices, the cholesky parametrisation of the multivariate normal PDF is much more efficient. The typical PDF you see is:
\begin{equation*} p(y | \mu, \Sigma) = \frac{1}{(2 \pi)^{d / 2} |\Sigma|^{1/2}} e^{-\frac{1}{2}(y - \mu)^T \Sigma^{-1} (y - \mu)} \end{equation*}where \(d\) is the dimension of the random vector. Implementing this directly gives something like:
import scipy.stats as stats def multivariate_normal_pdf(y, mu, Sigma): d = np.shape(mu)[0] sqrt_det = np.sqrt(np.linalg.det(Sigma)) two_pi_power = np.power((2.0 * np.pi), d / 2.0) y_minus_mu = y - mu exp = np.exp(-0.5 * y_minus_mu.T @ np.linalg.inv(Sigma) @ y_minus_mu) return 1.0 / (two_pi_power * sqrt_det) * exp y = np.random.normal(0.0, 0.1, [d, 1]) mu = np.random.normal(0.0, 0.1, [d, 1]) me = multivariate_normal_pdf(y, mu, Sigma)[0, 0] scipy = stats.multivariate_normal.pdf(np.squeeze(y), np.squeeze(mu), Sigma) print("me: {:0.8f}".format(me)) print("scipy: {:0.8f}".format(scipy))
me: 0.10220544 scipy: 0.10220544
However if we work with the cholesky parameterisation (i.e. use \(L\)) we firstly have:
\begin{equation*} |\Sigma| = |L L^T| = |L| |L^T| = |L|^2 = (\prod_{i=1}^{d} L_{i, i})^2 \end{equation*}since for square matrices \(A\) and \(B\):
- \(|A B| = |A| |B|\)
- \(|A^T| = |A|\)
- for a triangular matrix \(L\), the determinant is the product of the diagonal entries
see properties of the determinant. Additionally, the term in the PDF is \(|\Sigma|^{1/2}\), so in the PDF we need only calculate \(|\Sigma|^{1/2} = \prod_{i=1}^{d} L_{i, i}\). In code the square root of the determinant is:
sqrt_Sigma_det = np.sqrt(np.linalg.det(Sigma)) sqrt_L_det = np.prod(np.diag(L)) print("|Sigma|^0.5 = {:0.8f}".format(sqrt_Sigma_det)) print("|L| = {:0.8f}".format(sqrt_L_det))
|Sigma|^0.5 = 0.03030453 |L| = 0.03030453
We secondly have:
\begin{equation*} \Sigma^{-1} = (L L^T)^{-1} = L^{-T} L^{-1} \end{equation*}since for invertible square matrices \(A\) and \(B\): \((A B)^{-1} = B^{-1} A^{-1}\) see properties of invertible matrix. Thus:
\begin{equation*} (y - \mu)^T \Sigma^{-1} (y - \mu) = (y - \mu)^T L^{-T} L^{-1} (y - \mu) = (L^{-1}(y - \mu))^T L^{-1} (y - \mu) \end{equation*}since \((AB)^T = B^T A^T\) and \((A^T)^{-1} = (A^{-1})^T = A^{-T}\) - see properties of matrix transpose. This term is the square of the Mahalanobis distance, which I learned from looking at the source code in the pytorch implementation.
To calculate the \(L^{-1} (y - \mu)\) term, we can use a lapack solver which
exploits that \(L\) is triangular. The lapack solver dtrtrs
finds \(x\) in \(A x
= b\) exploiting that \(A\) is triangular (see e.g. the intel docs and this
netlib table), so if we set \(A = L^{-1}\) and \(b = y - \mu\) then it should
find the solution \(x = L^{-1} (y - \mu)\) since:
Considering just this part in code gives:
import scipy.linalg.lapack as lapack y_minus_mu = y - mu L_inv_y_minus_mu, _ = lapack.dtrtrs(L, y_minus_mu, lower=True) L_mahalanobis_squared = L_inv_y_minus_mu.T @ L_inv_y_minus_mu Sigma_mahalanobis_squared = y_minus_mu.T @ np.linalg.inv(Sigma) @ y_minus_mu print("Sigma mahalanobis^2 = {:0.8f}".format(Sigma_mahalanobis_squared[0, 0])) print("L mahalanobis^2 = {:0.8f}".format(L_mahalanobis_squared[0, 0]))
Sigma mahalanobis^2 = 4.20294853 L mahalanobis^2 = 4.20294853
Putting it all together, a cholesky parametrised implementation might look something like:
def multivariate_normal_cholesky_pdf(y, mu, L): d = np.shape(mu)[0] sqrt_det = np.prod(np.diag(L)) two_pi_power = np.power((2.0 * np.pi), d / 2.0) L_inv_y_minus_mu, _ = lapack.dtrtrs(L, y - mu, lower=True) mahalanobis_squared = L_inv_y_minus_mu.T @ L_inv_y_minus_mu return 1.0 / (two_pi_power * sqrt_det) * np.exp(-0.5 * mahalanobis_squared) me_again = multivariate_normal_cholesky_pdf(y, mu, L)[0, 0] print("me again: {:0.8f}".format(me_again)) print("scipy again: {:0.8f}".format(scipy))
me again: 0.10220544 scipy again: 0.10220544
So there we go, we even had a wee look at lapack in the end!