Benchmark / Leaderboard

Leaderboard

The headline signal is target adaptation gain — skill in bits over the own-routine baseline R2 — with calibration and target specificity (the permutation gate) as pass/fail context. The illustrative "overall" column is a pilot roll-up, not a canonical composite.

⚠️

Illustrative pilot data. Every row below is a mock baseline on benchmark version v0.1-pilot. No official submissions or empirical results exist yet. Higher is better for every metric except calibration error and long-horizon degradation (lower is better).

Loading leaderboard…

Metric definitions
  • Overall score — illustrative pilot roll-up for display only.
  • Target adaptation gain — Skill in bits over the own-routine baseline R2 (headline).
  • Calibration error — top-label ECE; lower is better.
  • Evidence attribution — whether cited evidence supports the forecast.
  • Counterfactual consistency — coherence of action-conditioned forecasts.
  • Temporal-order sensitivity / wrong-target penalty — skill lost under shuffled history / the permutation gate (higher = more target-specific).
  • Long-horizon degradation — skill decay at long horizon; lower is better.
  • Modality contribution — marginal skill from added evidence streams.