Modeling the outcome of soccer games

predictive-modeling

Author

René Valenzuela

Published

July 18, 2023

Modified

March 28, 2025

Introduction

Our goal is to create a model to predict the outcome of La Liga games. The outcome can be one of three possible values: \(H\) (home team wins), \(A\) (away team wins) and \(D\) (draw).

We will use the balanced accuracy score as the metric to rank the different models. Balanced accuracy is the macro average of recall scores per class or equivalently, raw accuracy where each sample is weighted according to the prevalence of its true class. We choose it because it is somewhat robust to class size variations as it avergaes the performance over all classes rather than favoring the larger ones.

Dataset

The original dataset consists of the results of each of the games played during the 2021-2022 and 2022-2023 seasons. A sample is shown in Table 1,

Table 1: Input dataset. Results of the 21-22 La Liga season

	season	date	time	ht	at	fthg	ftag	ftr
0	s_21_22	2021-08-13	20:00	Valencia	Getafe	1	0	H
1	s_21_22	2021-08-14	18:30	Cadiz	Levante	1	1	D
2	s_21_22	2021-08-14	18:30	Mallorca	Betis	1	1	D
3	s_21_22	2021-08-14	21:00	Alaves	Real Madrid	1	4	A
4	s_21_22	2021-08-14	21:00	Osasuna	Espanol	0	0	D

where fthg stands for full-time home team goals, ftag stands for full-time away team goals and ftr stands for full-time result (outcome of the game).

Model 1: Independent Poisson of number of goals scored by team

High level approach

We let the two teams participating in a game be Team A and Team B.
We let the outcome of the game be W if the predicted result is that Team A wins, T if the game is predicted to end in a tie and L if Team A is predicted to lose the game.
For each team we model the number of goals scored in a game as a Poisson distribution and we fit one based on historical data.
We assume the number of goals scored by a team in a game is independent of the number of goals scored by the adversary and create a joint probability funcion as the product of the marginals
We evaluate the joint distribution on a grid of possible scores
We assign the game outcome as the one corresponding to the maximum probability of the game final score. That is, if the final score with highest probability is Team A: 3 and Team B: 2 then the outcome of the game will be W

Hashing out the details

Fitting the Poisson model

The probability mass function of a Poisson distributed random variable \(X\) is: \[ \begin{equation*} p_{X}(k) = \mathbb{P}\text{r}\left(X = k\right) = e^{-\lambda}\frac{\lambda^{k}}{k!} \end{equation*} \]

Given a sample of \(n\) realizations of the random variable the MLE of \(\lambda\) is simply the sample mean, i.e., \[ \tilde{\lambda}_{MLE} = \frac{1}{n}\sum_{i=1}^{n} x_{i} \]

To fit the Poisson r.v. we must define a sample from which the \(\lambda\) parameter will be estimated. The simplest possible approach is to use as a sample the results of the last \(N\) games where \(N\) is a fixed number. As a starting point we will choose \(N = 10\). Since some teams play more than one La Liga game a week 10 games roughly corresponds to a 2 month historic period.

We will fit a Poisson model for each team and for each week (i.e. after a game we will update the model). The implementation is done using a moving average rolling window of the 10 last games of each team.

Implementation

Starting from the original dataset we construct the input dataset for model fitting. In this dataset we keep for each team the number of goals they scored and make sure we order them in the order of occurrence. A sample is shown in Table 2,

Table 2: Results for the first twelve games of Alaves during the 2021-2022 season

	season	date	time	team	goals
0	s_21_22	2021-08-14	21:00	Alaves	1
1	s_21_22	2021-08-21	16:00	Alaves	0
2	s_21_22	2021-08-27	21:15	Alaves	0
3	s_21_22	2021-09-18	20:00	Alaves	0
4	s_21_22	2021-09-22	18:30	Alaves	0
5	s_21_22	2021-09-25	13:00	Alaves	1
6	s_21_22	2021-10-01	20:00	Alaves	0
7	s_21_22	2021-10-18	18:00	Alaves	0
8	s_21_22	2021-10-23	15:15	Alaves	2
9	s_21_22	2021-10-26	18:00	Alaves	1
10	s_21_22	2021-10-30	20:00	Alaves	1
11	s_21_22	2021-11-06	17:30	Alaves	2

As a check, we verify the number of games each team has played during the course of two seasons. In La Liga, 20 teams play 38 games each during the course of the season. At the end of the season 3 teams are relegated to the second division and 3 are promoted. Hence, over the course of two seasons there should be 17 teams that played both seasons and 6 teams that only played one season. The result of this count is shown in Table 3

Table 3: Count of total games played by teams over two seasons

	n_games	team
0	38	6
1	76	17

Model fitting consists in calculating for each team the rolling mean over a period of previous 10 games. We have to ensure that the result of the current game is not used rolling mean calculation (if we didn’t we would have data leakage). The first estimate for the rolling mean will be for the 11th game. A sample of the model fitting results is shown in Table 4

Table 4: Model fitting for the first twelve games of Alaves during the 2021-2022 season

	season	date	time	team	goals	ma10	games_played	season_game
0	s_21_22	2021-08-14	21:00	Alaves	1	NaN	0	1
1	s_21_22	2021-08-21	16:00	Alaves	0	NaN	1	2
2	s_21_22	2021-08-27	21:15	Alaves	0	NaN	2	3
3	s_21_22	2021-09-18	20:00	Alaves	0	NaN	3	4
4	s_21_22	2021-09-22	18:30	Alaves	0	NaN	4	5
5	s_21_22	2021-09-25	13:00	Alaves	1	NaN	5	6
6	s_21_22	2021-10-01	20:00	Alaves	0	NaN	6	7
7	s_21_22	2021-10-18	18:00	Alaves	0	NaN	7	8
8	s_21_22	2021-10-23	15:15	Alaves	2	NaN	8	9
9	s_21_22	2021-10-26	18:00	Alaves	1	NaN	9	10
10	s_21_22	2021-10-30	20:00	Alaves	1	0.5	10	11
11	s_21_22	2021-11-06	17:30	Alaves	2	0.5	11	12

The ovarall results of model fitting can be visualized in Figure 1, a heatmap of the rate parameter for each team and for each week.

Figure 1: Heatmap of rate parameter by team and season game

The teams are ordered by rows with the team that finished in the first place in the first row and so on. Each square on the grid is colored according to the value of the fitted rate parameter and the columns represent the number of games played, i.e. the column labeled 11 represents the eleventh game of each team. This will almost always correspond to the week of the season. Also, note that the first game for which we have a parameter is the eleventh game of each team since we chose an 10 games window for teh moving average

Teams in the top rows have a higher rate parameter and also some teams that went on a scoring streak during the season. Amongst them we have Villareal at the start of the second leg and Levante at the end of the season.

Generating predictions

To generate predictions we first attach the estimated rate parameters of the home team and of the away team for each game played in the season:

	season	ht	at	ftr	hma10	ama10
0	s_21_22	Valencia	Getafe	H	NaN	NaN
1	s_21_22	Cadiz	Levante	D	NaN	NaN
2	s_21_22	Mallorca	Betis	D	NaN	NaN
3	s_21_22	Alaves	Real Madrid	A	NaN	NaN
4	s_21_22	Osasuna	Espanol	D	NaN	NaN

We then generate (for each game) all possible results starting from 0-0 and up to 5-5, i.e. 36 possible scores for each game and assign the corresponding probability using the Poisson mass probability function. Figure 2 shows a stemplot of the probabilities assigned to each of the 36 scores for one of the games in the league. In this case, since the score with higher probability is 1-1 we would predict the outcome of the game as a draw D.

Figure 2: Stemplot of a single game probability

The accuracy of the model is 0.3711 and the balanced accuracy is 0.3952