EPL Match Forecasting — Aaryan Kamdar

Overview

This project presents a unified analysis of probabilistic forecasting for English Premier League match outcomes. Match outcomes are modeled as three mutually exclusive classes — Home Win (H), Draw (D), and Away Win (A) — and evaluated against outcome-specific climatology baselines using proper scoring rules, primarily the Ignorance Score.

Multiple validation strategies are employed to assess robustness under distribution shift. The results show that calibrated and ensemble-based ML models provide consistent improvements over climatology for Away and Draw outcomes, while Home Wins remain the most challenging to forecast reliably.

3,800

Total Matches

2014–23

Seasons Covered

3

Validation Strategies

2/3

Overall Model Wins

Outcome Distribution

The three outcomes are imbalanced — Home Win is the dominant class, making Away Win and Draw inherently harder to forecast. This motivates outcome-specific evaluation rather than aggregated accuracy.

Home Win (H)

~50%

Home Team Wins

Dominant class. Home advantage makes this the most frequent but also most important outcome to model correctly.

Draw (D)

~24%

Equal Goals

Rarest outcome. Both teams score equal goals. Climatology strong baseline due to low variance in frequency.

Away Win (A)

~26%

Away Team Wins

Most variable across seasons. Model beats climatology under temporal validation but struggles in cross-season splits.

Evaluation Framework

Ignorance Score (IGN) is the primary metric — it measures the information content of a forecast in bits: IGN = −log₂(p_true). Lower IGN = better. It strongly penalizes confident but wrong predictions and rewards well-calibrated probability estimates. Climatology — the empirical base rate of each outcome in training data — is used as the baseline. Any model that cannot beat climatology cannot be considered probabilistically informative.

Accuracy is included only as a descriptive metric. It ignores probability magnitudes, treats confident and uncertain predictions equally, and fails to penalize overconfident errors — so identical accuracy can hide very different probabilistic quality.

01

Feature Engineering

Home/away goals, shots, full-time result, team identity from Football-Data.co.uk

02

7 Base Models

Logistic Regression, Calibrated RF, Calibrated Ridge, XGBoost, Naive Bayes, Gradient Boost, Isotonic LogReg

03

2 Ensembles

Best Ensemble + Calibrated Ensemble combining predictions via probability averaging

04

Probability Calibration

CalibratedClassifierCV with isotonic and sigmoid methods for reliable confidence estimates

05

3 Validation Strategies

Temporal (2014–22 → 2023), Even→Odd, Odd→Even — each revealing different robustness aspects

06

IGN Evaluation

Outcome-specific Ignorance Score comparison vs climatology baseline per strategy

Validation Strategies

Strategy 1

Temporal

Train 2014–22 → Test 2023

Hardest strategy — exposes model to distribution shift. 2023 had different outcome frequencies than training years, hurting both model and climatology. Strong performance here = genuine predictive skill.

Strategy 2

Even → Odd

1,900 train · 1,900 test

Train on even-numbered seasons, test on odd. Adjacent seasons share similar competitive structure, reducing long-term shift. Large test set yields stable IGN estimates.

Strategy 3

Odd → Even

1,900 train · 1,900 test

Mirror of Even→Odd. Consistent results across both interleaved strategies confirms improvements are not a partition artifact.

Results by Validation Strategy

TRAIN: 3,420 matches (2014–2022) · TEST: 380 matches (2023)

Distribution Shift: The 2023 test year had significantly different outcome distribution — Home Win 46.8% vs training 50.9% (−4.1%). This makes temporal validation the hardest strategy.

Overall 3-Class Results

Model	Accuracy	IGN	vs Climatology
Climatology	46.8%	1.5338	BASELINE
Calibrated Ridge	46.8%	1.5422	−0.0085
Calibrated RF	46.8%	1.5446	−0.0108
Logistic Regression	46.8%	1.5456	−0.0119
Best Ensemble	46.8%	1.5476	−0.0139
Calibrated Ensemble	46.8%	1.5490	−0.0153

Away Win — MODEL WINS ✓

Model	Model IGN	Clim IGN	Δ IGN	Winner
Climatology	0.8398	0.8398	BASE	—
Logistic Regression	0.8312	0.8398	+0.0086	Model
Calibrated Ensemble	0.8330	0.8398	+0.0068	Model
Best Ensemble	0.8353	0.8398	+0.0046	Model
Calibrated RF	0.8388	0.8398	+0.0010	Model

Temporal Finding: Only Away Win shows model improvement (+0.0086 bits best). Home Win and Draw both lose to climatology — the distribution shift between 2022 training and 2023 test hurts performance significantly overall.

TRAIN: 1,900 matches (even seasons) · TEST: 1,900 matches (odd seasons)

Overall 3-Class Results

Model	Accuracy	IGN	vs Climatology
Climatology	51.3%	1.4869	BASELINE
Calibrated RF	51.3%	1.4853	+0.0016
Calibrated Ridge	51.3%	1.4853	+0.0017
Best Ensemble	51.2%	1.4860	+0.0010

Home Win — MODEL WINS ✓

Model	Model IGN	Clim IGN	Δ IGN	Winner
Calibrated RF	0.9977	1.0002	+0.0025	Model
Calibrated Ensemble	0.9978	1.0002	+0.0024	Model
Best Ensemble	0.9986	1.0002	+0.0016	Model

Draw — MODEL WINS ✓

Model	Model IGN	Clim IGN	Δ IGN	Winner
Best Ensemble	0.7803	0.7815	+0.0012	Model
Calibrated RF	0.7807	0.7815	+0.0008	Model
Logistic Regression	0.7810	0.7815	+0.0005	Model

Even→Odd Finding: Model beats climatology overall and wins on both Home Win and Draw. Away Win remains elusive — Climatology wins on Away in this strategy. Overall IGN improvement of +0.0017 bits.

TRAIN: 1,900 matches (odd seasons) · TEST: 1,900 matches (even seasons)

Overall 3-Class Results

Model	Accuracy	IGN	vs Climatology
Climatology	49.7%	1.5033	BASELINE
Best Ensemble	49.9%	1.5018	+0.0015
Calibrated RF	49.7%	1.5036	−0.0003
Logistic Regression	49.6%	1.5054	−0.0021

Home Win — MODEL WINS ✓

Model	Model IGN	Clim IGN	Δ IGN	Winner
Isotonic LogReg	0.9975	1.0007	+0.0032	Model
Best Ensemble	0.9978	1.0007	+0.0028	Model

Draw — MODEL WINS ✓

Model	Model IGN	Clim IGN	Δ IGN	Winner
Logistic Regression	0.8006	0.8044	+0.0038	Model
Best Ensemble	0.8022	0.8044	+0.0022	Model

Odd→Even Finding: Mirrors Even→Odd — model wins on Home Win (+0.0032 best) and Draw (+0.0038 best). Away Win again lost to climatology. Consistent results across both interleaved strategies confirms robustness.

Final Summary — Model vs Climatology Wins

Strategy	Home Win	Draw	Away Win	Overall
Temporal	Clim	Clim	Model	Clim
Even → Odd	Model	Model	Clim	Model
Odd → Even	Model	Model	Clim	Model
Total Wins	2/3	2/3	1/3	2/3

Home Win

2/3 wins

Draw

2/3 wins

Away Win

1/3 wins

Overall

2/3 wins

Key Insight: Away Win is the hardest outcome — model loses to climatology in 2/3 strategies, suggesting away performance is inherently more variable and sensitive to distribution shift. Home Win and Draw show consistent model improvement in both interleaved strategies.

Key Findings

Away Win is hardest — Model loses to climatology in 2/3 validation strategies. Away performance is inherently variable and difficult to predict robustly across seasons.
Home Win & Draw — Model consistently beats climatology in both interleaved strategies (Even→Odd and Odd→Even), demonstrating genuine probabilistic skill beyond historical averages.
Temporal validation is hardest — Distribution shift between 2014–2022 training and 2023 test hurts performance. 2023 Home Win rate (46.8%) differed significantly from training (50.9%).
Best improvement — +0.0086 bits on Away Win (Temporal, Logistic Regression) and +0.0032 bits on Home Win (Odd→Even, Isotonic LogReg).
Accuracy is misleading — All models achieve identical accuracy to climatology (46.8–51.3%) yet differ substantially in IGN. Proper scoring rules are essential for probabilistic evaluation.
Consistent results across interleaved splits — Even→Odd and Odd→Even both show model wins on Home Win and Draw, confirming results are not partition-dependent.

Conclusion

This project demonstrated a comprehensive framework for evaluating probabilistic football forecasts across three match outcomes using proper scoring rules. Models sometimes outperformed climatology for Home Win and Draw, indicating meaningful probabilistic skill beyond historical averages.

Away Wins proved most challenging — models lost to climatology in 2/3 strategies, suggesting away performance is inherently more variable. These findings emphasize the importance of proper scoring rules: accuracy alone completely fails to capture calibration and information content differences between models.

Further improvements likely require larger and richer feature sets (player-level data, form indices, injuries) and potentially Poisson-based goal models rather than direct outcome classifiers. Multi-task training on multiple related outcomes may also reduce the specialization-generalization tension observed here.

Probabilistic EPLMatch Forecasting

Overview

Outcome Distribution

Evaluation Framework

Validation Strategies

Results by Validation Strategy

Key Findings

Conclusion

Probabilistic EPL
Match Forecasting