Checking soccer models with PPC

As far as I can think right now (not very far), there are two things you typically want to do once you have a model and done some inference:

Model comparison
Model checking

The former allows you to evaluate the relative strengths of models and answer questions like "should I use this model or that model?". The latter allows you to look for problems in your model and helps with answering "is my model any good?".

In this post I am going to focus on the latter, and consider a basic posterior predictive check (PPC).

I've used a simple model (disclaimer: I already know plenty of things that are wrong with it) concerned with predicting the number of goals scored by competing soccer teams in the English Premier League during the seasons 2010/2011 - 2018/2019. The model assumes that the number of goals scored by each team follows an independent Poisson distribution with rates dependant on the strength of the teams and a home advantage. The data is downloaded in CSV format from www.football-data.co.uk.

I will not discuss the model in more detail than saying that there is an "attack" and "defence" parameter for each team, and the model is defined by scoring rates of:

\begin{align*} \log(\lambda_h) &= \alpha + \gamma + att_h - def_a \\ \log(\lambda_a) &= \alpha + att_a - def_h \end{align*}

for the home (\(h\)) and away (\(a\)) teams respectively, where \(\alpha\) is an intercept term, \(\gamma\) is the home advantage term, \(att_k\) is the attacking strength of team \(k\), and \(def_k\) is the defensive strength of team \(k\).

This type of model has been discussed loads, for anyone interested please read Dixon and Coles (1997) which I can't believe is now over 20 years old! Incidentally, Stuart Coles (one of the authors of the linked paper) now runs a blog - Smartodds Loves Statistics - which is worth a read.

Anyway, here is the model coded up in stan, plus some prior stuff (nothing particularly interesting or informative) that I haven't bothered to discuss since it is not the point of this post:

data {
    int<lower = 0> n;
    int<lower = 0> n_teams;
    int<lower = 1, upper = n_teams> home_team[n];
    int<lower = 1, upper = n_teams> away_team[n];
    int<lower = 0> home_goals[n];
    int<lower = 0> away_goals[n];
}

parameters {
    real alpha;
    real gamma;
    vector[n_teams] att;
    vector[n_teams] def;
    real<lower = 0.0> sigma_teams;
}

transformed parameters {
    vector[n] home_log_rate = alpha + gamma + att[home_team] - def[away_team];
    vector[n] away_log_rate = alpha + att[away_team] - def[home_team];
}

model {
    alpha ~ normal(0.0, 1.0);
    gamma ~ normal(0.0, 1.0);
    att ~ normal(0.0, sigma_teams);
    def ~ normal(0.0, sigma_teams);
    sigma_teams ~ normal(0.0, 1.0);

    home_goals ~ poisson_log(home_log_rate);
    away_goals ~ poisson_log(away_log_rate);
}

generated quantities {
    int rep_home_goals[n] = poisson_log_rng(home_log_rate);
    int rep_away_goals[n] = poisson_log_rng(away_log_rate);
}

Let's also have a brief look at the data:

library(readr)
library(dplyr)

soccer_data <- lapply(10:18, function(x) {
    season <- paste0(x, x + 1)
    url <- file.path("http://www.football-data.co.uk/mmz4281", season, "E0.csv")
    read_csv(url, col_types = cols())
}) %>% bind_rows()

soccer_teams <- sort(unique(c(soccer_data$HomeTeam, soccer_data$AwayTeam)))

soccer_data <- soccer_data %>%
    mutate(home_team = factor(HomeTeam, soccer_teams),
           away_team = factor(AwayTeam, soccer_teams),
           home_goals = as.integer(FTHG),
           away_goals = as.integer(FTAG)) %>%
    select(matches("^home_|^away_")) %>%
    filter(complete.cases(.))

print(soccer_data %>% head())

Aston Villa	West Ham	3	0
Blackburn	Everton	1	0
Bolton	Fulham	0	0
Chelsea	West Brom	6	0
Sunderland	Birmingham	2	2
Tottenham	Man City	0	0

All looks good… Now we can perform inference:

library(rstan)

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

NUM_CHAINS <- 3
NUM_SAMPLES <- 1000
NUM_WARMUP <- 1000

model <- stan_model(stan_file)

fit <- sampling(model,
                seed = 666,
                chains = NUM_CHAINS,
                warmup = NUM_WARMUP,
                iter = NUM_WARMUP + NUM_SAMPLES,
                pars = c("home_log_rate", "away_log_rate"),
                include = FALSE,
                data = list(n = nrow(soccer_data),
                            n_teams = length(soccer_teams),
                            home_team = unclass(soccer_data$home_team),
                            away_team = unclass(soccer_data$away_team),
                            home_goals = soccer_data$home_goals,
                            away_goals = soccer_data$away_goals))

It's a bit daunting to look at all that data in a table (i.e. print(fit)) and also it doesn't show the team names. Luckily, it is very easy to do some nice plots using bayesplot if we transform the fit to an array and rename some of the dimensions. Here are the non team specific parameter estimates:

library(bayesplot)
library(ggplot2)

posterior <- as.array(fit)

nmes <- dimnames(posterior)$parameters
dimnames(posterior)$parameters[grep("^att\\[", nmes)] <- paste0("att_", soccer_teams)
dimnames(posterior)$parameters[grep("^def\\[", nmes)] <- paste0("def_", soccer_teams)

mcmc_pairs(posterior, pars = c("alpha", "gamma", "sigma_teams"))

The team attack parameters:

mcmc_intervals(posterior, regex_pars = "^att_") + xlab("Attack")

The team defence parameters:

mcmc_intervals(posterior, regex_pars = "^def_") + xlab("Defence")

So looks fairly sensible. Great. I also had a look at Diagnostic plots for the No-U-Turn-Sampler and some other things but don't want to clutter this post with them.

The point of this was to consider PPC so here it is. The idea is to generate replication datasets drawn from the posterior predictive distribution and see if they are consistent with our observed data. The observed data/response is typically denoted \(y\) and the simulated replications \(y_{rep}\). This is why at the bottom of the stan code above we use the generated quantities block to simulate home and away goals.

I'm only going to consider (home team) supremacy here, which is the number of goals the home team wins by (e.g. a supremacy of 0 is a draw). It will be apparent why afterwards (I hope).

Unfortunately in doing this I stumbled upon what has now been filled as a bug in the bayesplot package. Many thanks to Jonah Gabry for getting back to me about it so quickly! Nevertheless I hacked the package (deleted a few lines) and now we can look at observed and simulated supremacy:

MAX_SUPREMACY <- 5

clamp <- function(x) {
    pmin(pmax(x, -MAX_SUPREMACY), MAX_SUPREMACY)
}

observed_supremacy <- soccer_data$home_goals - soccer_data$away_goals

rep_home_goals <- posterior[ , , grep("^rep_home_goals\\[", nmes)]
rep_away_goals <- posterior[ , , grep("^rep_away_goals\\[", nmes)]

rep_supremacy <- matrix(rep_home_goals - rep_away_goals,
                        NUM_CHAINS * NUM_SAMPLES,
                        nrow(soccer_data))

ppc_bars(clamp(observed_supremacy), clamp(rep_supremacy), freq = FALSE) +
    xlab("Supremacy")

So what does that mean? Well, it looks like (as I was hoping for) the model underestimates the probability of 0 supremacy (a draw). This has been noted in lots of papers and it makes sense - the model assumes independence in the home and away goals, or in other words, that the teams just keep playing the same throughout the match regardless of the score.

There are lots of other things to look at too, for example you can consider the total number of goals, the number of home/away goals, test statistics like the mean or maximum number of goals. It may also be worth considering grouping the data by season or league (if I had bothered to use multiple leagues).

In conclusion, I think PPCs are a really powerful and intuitive way to explore models and I hope to see and use more of them where applicable.