← Back to Projects

Probabilistic ML · Sports Forecasting · ECE 6514

Probabilistic EPL
Match Forecasting

Calibrated ensemble models for 3-class outcome forecasting across 3,800 Premier League matches (2014–2023). Evaluated with Ignorance Score across 3 validation strategies. Model beats climatology in 2/3 overall strategies and wins on Home Win and Draw outcomes consistently.

Probabilistic MLProper Scoring Rules Calibrated EnsembleXGBoost Logistic RegressionRandom Forest Ignorance Scorescikit-learn

Overview

This project presents a unified analysis of probabilistic forecasting for English Premier League match outcomes. Match outcomes are modeled as three mutually exclusive classes — Home Win (H), Draw (D), and Away Win (A) — and evaluated against outcome-specific climatology baselines using proper scoring rules, primarily the Ignorance Score.

Multiple validation strategies are employed to assess robustness under distribution shift. The results show that calibrated and ensemble-based ML models provide consistent improvements over climatology for Away and Draw outcomes, while Home Wins remain the most challenging to forecast reliably.

3,800
Total Matches
2014–23
Seasons Covered
3
Validation Strategies
2/3
Overall Model Wins

Outcome Distribution

The three outcomes are imbalanced — Home Win is the dominant class, making Away Win and Draw inherently harder to forecast. This motivates outcome-specific evaluation rather than aggregated accuracy.

Home Win (H)
~50%
Home Team Wins
Dominant class. Home advantage makes this the most frequent but also most important outcome to model correctly.
Draw (D)
~24%
Equal Goals
Rarest outcome. Both teams score equal goals. Climatology strong baseline due to low variance in frequency.
Away Win (A)
~26%
Away Team Wins
Most variable across seasons. Model beats climatology under temporal validation but struggles in cross-season splits.

Evaluation Framework

Ignorance Score (IGN) is the primary metric — it measures the information content of a forecast in bits: IGN = −log₂(p_true). Lower IGN = better. It strongly penalizes confident but wrong predictions and rewards well-calibrated probability estimates. Climatology — the empirical base rate of each outcome in training data — is used as the baseline. Any model that cannot beat climatology cannot be considered probabilistically informative.

Accuracy is included only as a descriptive metric. It ignores probability magnitudes, treats confident and uncertain predictions equally, and fails to penalize overconfident errors — so identical accuracy can hide very different probabilistic quality.

01
Feature Engineering
Home/away goals, shots, full-time result, team identity from Football-Data.co.uk
02
7 Base Models
Logistic Regression, Calibrated RF, Calibrated Ridge, XGBoost, Naive Bayes, Gradient Boost, Isotonic LogReg
03
2 Ensembles
Best Ensemble + Calibrated Ensemble combining predictions via probability averaging
04
Probability Calibration
CalibratedClassifierCV with isotonic and sigmoid methods for reliable confidence estimates
05
3 Validation Strategies
Temporal (2014–22 → 2023), Even→Odd, Odd→Even — each revealing different robustness aspects
06
IGN Evaluation
Outcome-specific Ignorance Score comparison vs climatology baseline per strategy

Validation Strategies

Strategy 1
Temporal
Train 2014–22 → Test 2023
Hardest strategy — exposes model to distribution shift. 2023 had different outcome frequencies than training years, hurting both model and climatology. Strong performance here = genuine predictive skill.
Strategy 2
Even → Odd
1,900 train · 1,900 test
Train on even-numbered seasons, test on odd. Adjacent seasons share similar competitive structure, reducing long-term shift. Large test set yields stable IGN estimates.
Strategy 3
Odd → Even
1,900 train · 1,900 test
Mirror of Even→Odd. Consistent results across both interleaved strategies confirms improvements are not a partition artifact.

Results by Validation Strategy

TRAIN: 3,420 matches (2014–2022)  ·  TEST: 380 matches (2023)

Distribution Shift: The 2023 test year had significantly different outcome distribution — Home Win 46.8% vs training 50.9% (−4.1%). This makes temporal validation the hardest strategy.

Overall 3-Class Results

ModelAccuracyIGNvs Climatology
Climatology46.8%1.5338BASELINE
Calibrated Ridge46.8%1.5422−0.0085
Calibrated RF46.8%1.5446−0.0108
Logistic Regression46.8%1.5456−0.0119
Best Ensemble46.8%1.5476−0.0139
Calibrated Ensemble46.8%1.5490−0.0153

Away Win — MODEL WINS ✓

ModelModel IGNClim IGNΔ IGNWinner
Climatology0.83980.8398BASE
Logistic Regression0.83120.8398+0.0086Model
Calibrated Ensemble0.83300.8398+0.0068Model
Best Ensemble0.83530.8398+0.0046Model
Calibrated RF0.83880.8398+0.0010Model
Temporal Finding: Only Away Win shows model improvement (+0.0086 bits best). Home Win and Draw both lose to climatology — the distribution shift between 2022 training and 2023 test hurts performance significantly overall.

TRAIN: 1,900 matches (even seasons)  ·  TEST: 1,900 matches (odd seasons)

Overall 3-Class Results

ModelAccuracyIGNvs Climatology
Climatology51.3%1.4869BASELINE
Calibrated RF51.3%1.4853+0.0016
Calibrated Ridge51.3%1.4853+0.0017
Best Ensemble51.2%1.4860+0.0010

Home Win — MODEL WINS ✓

ModelModel IGNClim IGNΔ IGNWinner
Calibrated RF0.99771.0002+0.0025Model
Calibrated Ensemble0.99781.0002+0.0024Model
Best Ensemble0.99861.0002+0.0016Model

Draw — MODEL WINS ✓

ModelModel IGNClim IGNΔ IGNWinner
Best Ensemble0.78030.7815+0.0012Model
Calibrated RF0.78070.7815+0.0008Model
Logistic Regression0.78100.7815+0.0005Model
Even→Odd Finding: Model beats climatology overall and wins on both Home Win and Draw. Away Win remains elusive — Climatology wins on Away in this strategy. Overall IGN improvement of +0.0017 bits.

TRAIN: 1,900 matches (odd seasons)  ·  TEST: 1,900 matches (even seasons)

Overall 3-Class Results

ModelAccuracyIGNvs Climatology
Climatology49.7%1.5033BASELINE
Best Ensemble49.9%1.5018+0.0015
Calibrated RF49.7%1.5036−0.0003
Logistic Regression49.6%1.5054−0.0021

Home Win — MODEL WINS ✓

ModelModel IGNClim IGNΔ IGNWinner
Isotonic LogReg0.99751.0007+0.0032Model
Best Ensemble0.99781.0007+0.0028Model

Draw — MODEL WINS ✓

ModelModel IGNClim IGNΔ IGNWinner
Logistic Regression0.80060.8044+0.0038Model
Best Ensemble0.80220.8044+0.0022Model
Odd→Even Finding: Mirrors Even→Odd — model wins on Home Win (+0.0032 best) and Draw (+0.0038 best). Away Win again lost to climatology. Consistent results across both interleaved strategies confirms robustness.

Final Summary — Model vs Climatology Wins

StrategyHome WinDrawAway WinOverall
TemporalClimClimModelClim
Even → OddModelModelClimModel
Odd → EvenModelModelClimModel
Total Wins2/32/31/32/3
Home Win
2/3 wins
Draw
2/3 wins
Away Win
1/3 wins
Overall
2/3 wins
Key Insight: Away Win is the hardest outcome — model loses to climatology in 2/3 strategies, suggesting away performance is inherently more variable and sensitive to distribution shift. Home Win and Draw show consistent model improvement in both interleaved strategies.

Key Findings

  • Away Win is hardest — Model loses to climatology in 2/3 validation strategies. Away performance is inherently variable and difficult to predict robustly across seasons.
  • Home Win & Draw — Model consistently beats climatology in both interleaved strategies (Even→Odd and Odd→Even), demonstrating genuine probabilistic skill beyond historical averages.
  • Temporal validation is hardest — Distribution shift between 2014–2022 training and 2023 test hurts performance. 2023 Home Win rate (46.8%) differed significantly from training (50.9%).
  • Best improvement — +0.0086 bits on Away Win (Temporal, Logistic Regression) and +0.0032 bits on Home Win (Odd→Even, Isotonic LogReg).
  • Accuracy is misleading — All models achieve identical accuracy to climatology (46.8–51.3%) yet differ substantially in IGN. Proper scoring rules are essential for probabilistic evaluation.
  • Consistent results across interleaved splits — Even→Odd and Odd→Even both show model wins on Home Win and Draw, confirming results are not partition-dependent.

Conclusion

This project demonstrated a comprehensive framework for evaluating probabilistic football forecasts across three match outcomes using proper scoring rules. Models sometimes outperformed climatology for Home Win and Draw, indicating meaningful probabilistic skill beyond historical averages.

Away Wins proved most challenging — models lost to climatology in 2/3 strategies, suggesting away performance is inherently more variable. These findings emphasize the importance of proper scoring rules: accuracy alone completely fails to capture calibration and information content differences between models.

Further improvements likely require larger and richer feature sets (player-level data, form indices, injuries) and potentially Poisson-based goal models rather than direct outcome classifiers. Multi-task training on multiple related outcomes may also reduce the specialization-generalization tension observed here.

← Previous
LLM Fine-Tuning: Math Reasoning