Benchmark / Leaderboard
Leaderboard
The headline signal is target adaptation gain — skill in bits over the own-routine baseline R2 — with calibration and target specificity (the permutation gate) as pass/fail context. The illustrative "overall" column is a pilot roll-up, not a canonical composite.
⚠️
Illustrative pilot data. Every row below is a mock baseline on benchmark version v0.1-pilot. No official submissions or empirical results exist yet. Higher is better for every metric except calibration error and long-horizon degradation (lower is better).
Loading leaderboard…
Metric definitions
- Overall score — illustrative pilot roll-up for display only.
- Target adaptation gain — Skill in bits over the own-routine baseline R2 (headline).
- Calibration error — top-label ECE; lower is better.
- Evidence attribution — whether cited evidence supports the forecast.
- Counterfactual consistency — coherence of action-conditioned forecasts.
- Temporal-order sensitivity / wrong-target penalty — skill lost under shuffled history / the permutation gate (higher = more target-specific).
- Long-horizon degradation — skill decay at long horizon; lower is better.
- Modality contribution — marginal skill from added evidence streams.