Transparency

How good is the model, honestly?

Every probability on this site comes from one Elo + Poisson model. Rather than ask you to trust it, here is how it scores on a leakage-free walk-forward backtest over 5,143 historical internationals — measured with the same proper scoring rules the forecasting field uses. The constants are fitted to minimise log-loss, not hand-tuned.

Log-loss
0.922
vs 1.052 base-rate ✓ beats

Punishes confident wrong calls hardest. Lower is better.

Ranked PS
0.185
ordered 1X2 (lower = better)

Like Brier, but credits being close in the home→draw→away order. Lower is better.

Accuracy
57.1%
most-likely outcome hit

How often the model's top pick was the actual result. Higher is better.

Calibration error
0.7%
ECE — lower = better calibrated

Average gap between stated probability and what actually happened. 0% = perfectly calibrated; lower is better.

Reliability curve

predicted vs observed

Each dot is a probability bucket: x = what the model said, y = what actually happened. A perfectly calibrated model sits on the diagonal. Dot size = number of predictions in the bucket.

predicted 5% → observed 7% (1454 preds) predicted 15% → observed 15% (2205 preds) predicted 26% → observed 26% (4539 preds) predicted 33% → observed 32% (2948 preds) predicted 45% → observed 45% (1319 preds) predicted 55% → observed 55% (1077 preds) predicted 65% → observed 65% (811 preds) predicted 75% → observed 77% (593 preds) predicted 85% → observed 84% (328 preds) predicted 93% → observed 95% (155 preds) model predicted probability → observed frequency →
Reliability bins: for each band of predicted probability, the mean predicted probability, the observed frequency (how often the predicted outcome actually occurred), and the number of predictions in that band. A perfectly calibrated model has predicted equal to observed in every band.
Predicted probability band Mean predicted Observed frequency Predictions
0–10% 5.1% 6.5% 1,454
10–20% 15.4% 15.5% 2,205
20–30% 25.9% 25.8% 4,539
30–40% 33.3% 31.7% 2,948
40–50% 44.9% 45.5% 1,319
50–60% 54.8% 54.9% 1,077
60–70% 64.7% 65.2% 811
70–80% 74.8% 77.2% 593
80–90% 84.5% 84.5% 328
90–100% 93.3% 94.8% 155

Fitted vs hand-set

held-out validation

The constants were fitted by minimising training log-loss; these are the scores on data the fit never saw. Fitting improves every metric — it isn't overfitting.

Metric (validation)Hand-setFitted
Log-loss0.8770.865
Brier0.5160.508
RPS0.1730.169
Accuracy59.8%60.4%
ECE2.5%1.7%

Fitted constants

ConstantHand-setFitted
K (rating volatility) 32 100.94
Home advantage (Elo) 65 97.25
Elo → goal-diff 0.004 0.0023
Goals baseline 2.55 2.90
Dixon-Coles ρ -0.05 -0.15

Backtest over 5,143 matches since 2020-01-01 · fit method scipy.Nelder-Mead · generated 2026-06-05. "Beats the market" is a different, harder bar — see CLV on the bet log.

Tournament 2026 — live

model vs actual results

As 2026 matches finish, the model's pre-kickoff predictions are scored against the actual results here — the honest, ongoing proof it stays calibrated on the tournament itself, not just the historical backtest above. Nothing graded yet; the first matches kick off 11 June.