Probabilistic ML · Sports Forecasting · ECE 6514
Calibrated ensemble models for 3-class outcome forecasting across 3,800 Premier League matches (2014–2023). Evaluated with Ignorance Score across 3 validation strategies. Model beats climatology in 2/3 overall strategies and wins on Home Win and Draw outcomes consistently.
This project presents a unified analysis of probabilistic forecasting for English Premier League match outcomes. Match outcomes are modeled as three mutually exclusive classes — Home Win (H), Draw (D), and Away Win (A) — and evaluated against outcome-specific climatology baselines using proper scoring rules, primarily the Ignorance Score.
Multiple validation strategies are employed to assess robustness under distribution shift. The results show that calibrated and ensemble-based ML models provide consistent improvements over climatology for Away and Draw outcomes, while Home Wins remain the most challenging to forecast reliably.
The three outcomes are imbalanced — Home Win is the dominant class, making Away Win and Draw inherently harder to forecast. This motivates outcome-specific evaluation rather than aggregated accuracy.
Ignorance Score (IGN) is the primary metric — it measures the information content of a forecast in bits: IGN = −log₂(p_true). Lower IGN = better. It strongly penalizes confident but wrong predictions and rewards well-calibrated probability estimates. Climatology — the empirical base rate of each outcome in training data — is used as the baseline. Any model that cannot beat climatology cannot be considered probabilistically informative.
Accuracy is included only as a descriptive metric. It ignores probability magnitudes, treats confident and uncertain predictions equally, and fails to penalize overconfident errors — so identical accuracy can hide very different probabilistic quality.
TRAIN: 3,420 matches (2014–2022) · TEST: 380 matches (2023)
Overall 3-Class Results
| Model | Accuracy | IGN | vs Climatology |
|---|---|---|---|
| Climatology | 46.8% | 1.5338 | BASELINE |
| Calibrated Ridge | 46.8% | 1.5422 | −0.0085 |
| Calibrated RF | 46.8% | 1.5446 | −0.0108 |
| Logistic Regression | 46.8% | 1.5456 | −0.0119 |
| Best Ensemble | 46.8% | 1.5476 | −0.0139 |
| Calibrated Ensemble | 46.8% | 1.5490 | −0.0153 |
Away Win — MODEL WINS ✓
| Model | Model IGN | Clim IGN | Δ IGN | Winner |
|---|---|---|---|---|
| Climatology | 0.8398 | 0.8398 | BASE | — |
| Logistic Regression | 0.8312 | 0.8398 | +0.0086 | Model |
| Calibrated Ensemble | 0.8330 | 0.8398 | +0.0068 | Model |
| Best Ensemble | 0.8353 | 0.8398 | +0.0046 | Model |
| Calibrated RF | 0.8388 | 0.8398 | +0.0010 | Model |
TRAIN: 1,900 matches (even seasons) · TEST: 1,900 matches (odd seasons)
Overall 3-Class Results
| Model | Accuracy | IGN | vs Climatology |
|---|---|---|---|
| Climatology | 51.3% | 1.4869 | BASELINE |
| Calibrated RF | 51.3% | 1.4853 | +0.0016 |
| Calibrated Ridge | 51.3% | 1.4853 | +0.0017 |
| Best Ensemble | 51.2% | 1.4860 | +0.0010 |
Home Win — MODEL WINS ✓
| Model | Model IGN | Clim IGN | Δ IGN | Winner |
|---|---|---|---|---|
| Calibrated RF | 0.9977 | 1.0002 | +0.0025 | Model |
| Calibrated Ensemble | 0.9978 | 1.0002 | +0.0024 | Model |
| Best Ensemble | 0.9986 | 1.0002 | +0.0016 | Model |
Draw — MODEL WINS ✓
| Model | Model IGN | Clim IGN | Δ IGN | Winner |
|---|---|---|---|---|
| Best Ensemble | 0.7803 | 0.7815 | +0.0012 | Model |
| Calibrated RF | 0.7807 | 0.7815 | +0.0008 | Model |
| Logistic Regression | 0.7810 | 0.7815 | +0.0005 | Model |
TRAIN: 1,900 matches (odd seasons) · TEST: 1,900 matches (even seasons)
Overall 3-Class Results
| Model | Accuracy | IGN | vs Climatology |
|---|---|---|---|
| Climatology | 49.7% | 1.5033 | BASELINE |
| Best Ensemble | 49.9% | 1.5018 | +0.0015 |
| Calibrated RF | 49.7% | 1.5036 | −0.0003 |
| Logistic Regression | 49.6% | 1.5054 | −0.0021 |
Home Win — MODEL WINS ✓
| Model | Model IGN | Clim IGN | Δ IGN | Winner |
|---|---|---|---|---|
| Isotonic LogReg | 0.9975 | 1.0007 | +0.0032 | Model |
| Best Ensemble | 0.9978 | 1.0007 | +0.0028 | Model |
Draw — MODEL WINS ✓
| Model | Model IGN | Clim IGN | Δ IGN | Winner |
|---|---|---|---|---|
| Logistic Regression | 0.8006 | 0.8044 | +0.0038 | Model |
| Best Ensemble | 0.8022 | 0.8044 | +0.0022 | Model |
Final Summary — Model vs Climatology Wins
| Strategy | Home Win | Draw | Away Win | Overall |
|---|---|---|---|---|
| Temporal | Clim | Clim | Model | Clim |
| Even → Odd | Model | Model | Clim | Model |
| Odd → Even | Model | Model | Clim | Model |
| Total Wins | 2/3 | 2/3 | 1/3 | 2/3 |
This project demonstrated a comprehensive framework for evaluating probabilistic football forecasts across three match outcomes using proper scoring rules. Models sometimes outperformed climatology for Home Win and Draw, indicating meaningful probabilistic skill beyond historical averages.
Away Wins proved most challenging — models lost to climatology in 2/3 strategies, suggesting away performance is inherently more variable. These findings emphasize the importance of proper scoring rules: accuracy alone completely fails to capture calibration and information content differences between models.
Further improvements likely require larger and richer feature sets (player-level data, form indices, injuries) and potentially Poisson-based goal models rather than direct outcome classifiers. Multi-task training on multiple related outcomes may also reduce the specialization-generalization tension observed here.