Benchmark / Tracks

Tracks & evaluation splits

TargetSpace is not one dataset or one domain. It is a shared evaluation apparatus with multiple application tracks. Every track shares the same scoring spine — sealed forecasts, R1/R2 baselines, calibration, permutation specificity, evidence ablation, deterministic outcome validation — and differs in target, evidence bands, horizon, validator, and readiness.

ℹ️

Honest readiness. TS-Personal is the flagship and only currently-instantiated track, as a synthetic pre-pilot. The other tracks are planned or research-stage: no empirical validation exists for them yet, and TargetSpace does not claim to solve health, energy, robotics, or enterprise forecasting.

Domain tracks

The apparatus (paper Section 9, Table 8). Choose a track to see its target object, evidence bands, horizon, example target states, validator, and readiness.

Loading tracks…

Evaluation splits

Task regimes — SWE-bench-style splits — currently instantiated within the flagship TS-Personal track. Each split has its own input/output, metrics, baselines, example task, and submission requirements.

Loading splits…

Evaluate on a track

The synthetic TS-Personal harness is runnable today. Submit results or request a private evaluation.

Submit a run Leaderboard Docs