Why Club Elo?

We tested seven prediction models against two full Premier League seasons. The simplest one won.

Our Season Forecast predicts where each Premier League team will finish by simulating thousands of possible seasons. The engine uses Club Elo ratings to estimate each team's chances in every remaining match. But is raw Elo enough? Could we improve predictions by factoring in home advantage or recent form?

We tested seven configurations of our model against two complete Premier League seasons (2023/24 and 2024/25). At every gameweek from GW10 to GW37 -- 56 cutoffs in total -- we ran 50,000 Monte Carlo simulations per model, then measured how closely predictions matched reality using two metrics: season-level RMSE and match-level log-loss.

The experiment

The seven configurations test three ideas:

Elo Only -- Club Elo ratings with a flat 65-point home advantage. No within-season adjustments. This is our baseline.
+ Venue Strength -- adjusts expected goals based on each team's actual home/away points-per-game ratio. Tested at four clamp widths (±5%, ±10%, ±20%, and ±50%) to see whether gentle or aggressive adjustments work better.
+ Venue + Form -- adds a momentum "tilt" based on the last 6 results vs season average. Tested at two strengths (±5% and ±10%), each paired with a matching venue clamp.

Wider clamps allow the model to make larger adjustments to Elo-derived predictions. Tighter clamps keep adjustments small, barely nudging the baseline.

Prediction accuracy by gameweek

Points RMSE (lower is better) for three representative configurations: the baseline, the worst performer, and the strongest challenger. Full results in the parameter sweep below.

2023/24

2024/25

Elo Only Wide Venue (worst) Tight Blend (best challenger)

The result

30 / 56

RMSE cutoffs won by Elo Only

39 / 56

log-loss cutoffs won by Elo Only

p < 0.003

closest challenger still significantly worse on match calibration

Elo Only won 30 of 56 cutoffs on season-level accuracy (RMSE) and 39 of 56 on match-level calibration (log-loss). No other configuration came close on both metrics.

Parameter sweep

All seven configurations ranked by average Points RMSE across both seasons. Log-loss measures match-level calibration (lower is better).

#	Configuration	Avg RMSE	Avg Log-Loss
1	Elo Only (baseline)	5.75	0.95
2	Venue ±5% + Form ±5%	5.75	0.96
3	Venue ±10% + Form ±10%	5.76	0.96
4	Venue ±5%	5.77	0.96
5	Venue ±10%	5.79	0.96
6	Venue ±20%	5.84	0.96
7	Venue (wide)	6.01	0.98

Match calibration

Points RMSE tells us how well a model predicts final league standings. But a model could stumble into accurate season totals while being poorly calibrated at the match level -- like predicting every game is a coin flip but getting lucky on the aggregate.

Log-loss measures something different: for each individual match, how much probability did the model assign to the outcome that actually happened? A model that says "60% home win" and the home team wins scores better than one that said "40% home win." It penalises overconfident wrong predictions harshly.

The best blended configurations (Venue ±5% + Form ±5%, and Venue ±10% + Form ±10%) manage to tie Elo Only on season RMSE. But on match-level log-loss, Elo Only is significantly better -- winning 39 of 56 cutoffs. The adjustments add noise to individual match predictions even when the errors roughly cancel out at the season level.

Statistical significance

We used the Wilcoxon signed-rank test (a non-parametric paired test) to compare each challenger against Elo Only across all 56 cutoffs.

All four venue-only adjustments are significantly worse than Elo Only on both RMSE and log-loss (p < 0.001).
The two blended models (Venue + Form) tie Elo Only on RMSE (p > 0.37, not significant) but are significantly worse on match-level log-loss (p < 0.003).
Wider clamps consistently perform worse. The relationship is monotonic: the more freedom you give the adjustments, the worse the predictions get.

Why does adding more data make predictions worse?

It seems counterintuitive, but there are good reasons the simplest model wins.

Club Elo ratings already encode home and away performance. They are built from decades of match results, with venue baked into every data point. When we layer a within-season venue multiplier derived from just 5-10 home matches, we replace a robust long-term signal with a noisy short-term one. A team that happens to have faced weaker opponents at home early on gets an inflated multiplier that actively misleads the model.

The form tilt has a similar problem. Our simulation already uses "hot" Elo updates -- after every simulated match, both teams' ratings shift based on the result. Momentum is modelled inside each simulation. Adding a separate form adjustment double-counts the effect, amplifying noise rather than signal.

The parameter sweep confirms this: wider adjustments hurt more. The ±50% venue clamp (which lets the model make large corrections) has the worst RMSE of all seven configurations. The tightest adjustments (±5%) barely move the needle -- they survive only because they barely change anything.

In short: when your base signal is strong, layering noisy corrections on top makes things worse, not better.

Methodology

50,000 Monte Carlo simulations per gameweek cutoff, per model configuration.
Seven configurations tested: one baseline (Elo Only), four venue-only variants (±5%, ±10%, ±20%, ±50% clamps), and two blended variants (venue + form at ±5% and ±10%).
Cutoff range: GW10 to GW37 (28 cutoffs per season, 56 total across both seasons).
Primary metric: Points RMSE -- root mean square error between predicted and actual end-of-season points across all 20 teams.
Secondary metric: match-level log-loss -- negative log of the probability assigned to the actual match outcome (home win, draw, or away win). Lower is better. Probabilities clamped to [0.001, 0.999].
Statistical test: Wilcoxon signed-rank test (non-parametric, paired) across all 56 cutoffs. Significance threshold: p < 0.05.
Elo ratings from Club Elo. Fixtures and standings from SportMonks.
Match scores sampled from a Poisson distribution. Expected goals derived from Elo difference. Elo updates weighted by goal margin (proportional to the square root of margin).