Transparency

How good is the model, honestly?

Every probability on this site comes from one Elo + Poisson model. Rather than ask you to trust it, here is how it scores on a leakage-free walk-forward backtest over 5,143 historical internationals — measured with the same proper scoring rules the forecasting field uses. The constants are fitted to minimise log-loss, not hand-tuned.

Log-loss

0.922

vs 1.052 base-rate ✓ beats

Punishes confident wrong calls hardest. Lower is better.

Ranked PS

0.185

ordered 1X2 (lower = better)

Like Brier, but credits being close in the home→draw→away order. Lower is better.

Accuracy

57.1%

most-likely outcome hit

How often the model's top pick was the actual result. Higher is better.

Calibration error

0.7%

ECE — lower = better calibrated

Average gap between stated probability and what actually happened. 0% = perfectly calibrated; lower is better.

Reliability curve

predicted vs observed

Each dot is a probability bucket: x = what the model said, y = what actually happened. A perfectly calibrated model sits on the diagonal. Dot size = number of predictions in the bucket.

Reliability bins: for each band of predicted probability, the mean predicted probability, the observed frequency (how often the predicted outcome actually occurred), and the number of predictions in that band. A perfectly calibrated model has predicted equal to observed in every band.
Predicted probability band	Mean predicted	Observed frequency	Predictions
0–10%	5.1%	6.5%	1,454
10–20%	15.4%	15.5%	2,205
20–30%	25.9%	25.8%	4,539
30–40%	33.3%	31.7%	2,948
40–50%	44.9%	45.5%	1,319
50–60%	54.8%	54.9%	1,077
60–70%	64.7%	65.2%	811
70–80%	74.8%	77.2%	593
80–90%	84.5%	84.5%	328
90–100%	93.3%	94.8%	155

Fitted vs hand-set

held-out validation

The constants were fitted by minimising training log-loss; these are the scores on data the fit never saw. Fitting improves every metric — it isn't overfitting.

Metric (validation)	Hand-set	Fitted
Log-loss	0.877	0.865
Brier	0.516	0.508
RPS	0.173	0.169
Accuracy	59.8%	60.4%
ECE	2.5%	1.7%

Fitted constants

Constant	Hand-set	Fitted
K (rating volatility)	32	100.94
Home advantage (Elo)	65	97.25
Elo → goal-diff	0.004	0.0023
Goals baseline	2.55	2.90
Dixon-Coles ρ	-0.05	-0.15

Backtest over 5,143 matches since 2020-01-01 · fit method scipy.Nelder-Mead · generated 2026-06-05. "Beats the market" is a different, harder bar — see CLV on the bet log.

Tournament 2026 — live

model vs actual results

As 2026 matches finish, the model's pre-kickoff predictions are scored against the actual results here — the honest, ongoing proof it stays calibrated on the tournament itself, not just the historical backtest above. Nothing graded yet; the first matches kick off 11 June.