Benchmark / Leaderboard

Leaderboard

The headline signal is target adaptation gain — skill in bits over the own-routine baseline R2 — with calibration and target specificity (the permutation gate) as pass/fail context. The illustrative "overall" column is a pilot roll-up, not a canonical composite.

⚠️

Illustrative pilot data. Every row below is a mock baseline on benchmark version v0.1-pilot. No official submissions or empirical results exist yet. Higher is better for every metric except calibration error and long-horizon degradation (lower is better).

Track

Split

Submission type

Version

Organization

Loading leaderboard…

Metric definitions

Overall score — illustrative pilot roll-up for display only.
Target adaptation gain — Skill in bits over the own-routine baseline R2 (headline).
Calibration error — top-label ECE; lower is better.
Evidence attribution — whether cited evidence supports the forecast.
Counterfactual consistency — coherence of action-conditioned forecasts.
Temporal-order sensitivity / wrong-target penalty — skill lost under shuffled history / the permutation gate (higher = more target-specific).
Long-horizon degradation — skill decay at long horizon; lower is better.
Modality contribution — marginal skill from added evidence streams.