What TargetSpace-Bench measures
TargetSpace-Bench evaluates target-conditioned longitudinal world modeling: the ability to build, update, and use a predictive model of a specific target from passive, multimodal, non-IID observation — measured through calibrated, sealed forecasting.
Definition & scope
A target
A persistent, evolving entity tracked over time: a person, agent, system, organization, environment, or process. The flagship track targets a consenting individual; other tracks target patients, energy assets, embodied agents, and projects.
A target state
The latent configuration a target acts to reach or maintain — a commitment, priority, constraint, or regime — evaluated only through its externally observable consequences. The object to forecast is the transition between target states.
Passive longitudinal observation
Evidence accrued over time without acting on the target: metadata, text, audio, passive multimodal streams, location, physiology. The benchmark measures whether richer, correctly-ordered observation improves target-specific forecasts.
The capability under test
Can a model become a predictive model of a particular target, rather than merely a model of generic scenes or generic events? A high score means calibrated, prospective, target-specific skill — nothing about a target's inner life.
Different from generic dynamics
TargetSpace is not a video-realism, robotics-manipulation, or intuitive-physics benchmark. Those evaluate generic dynamics, realism, and control. TargetSpace evaluates target-specific dynamics: whether prediction improves when a model is given the correct target's history in the correct temporal order. A system can render plausible scenes, generate fluent continuations, or execute dexterous control and still be unable to track this target and forecast where it turns next.
How it complements other benchmark families
The families are complementary, not competing. Each is the right instrument for a different question; TargetSpace is compared with them only on shared axes.
| Benchmark family | Primary object of evaluation | Typical input | Typical output | What it misses | How TargetSpace complements it |
|---|---|---|---|---|---|
| Physical reasoning / intuitive physics | physical plausibility, object permanence, causality, spatial continuity | short scenes / clips | plausibility or violation judgment | a persistent target; longitudinal adaptation; calibration over time | adds target-specific dynamics over generic physical law |
| Video generation / world simulation | visual realism, temporal consistency, plausible scene evolution | context frames | generated continuation | target identity; sealed prospective scoring; proper calibration | scores latent target-state transitions, not surface reconstruction |
| Embodied robotics | action utility, policy evaluation, manipulation / control success | proprioception, sensors, actions | actions; achieved configuration | passive longitudinal inference; calibration; an own-routine baseline | scores passive consequence forecasts, not control |
| Symbolic / event forecasting | probabilistic prediction of public events | question + context | calibrated probability | a tracked individual target; own-routine R2; permutation specificity | adds the target as the unit, with R2 + permutation controls |
| Agent memory / personalization | recall, preference modeling, retrieval QA | history / profile + query | held-out response / preference | prospective sealing; calibration; transition forecasting | scores why an episode matters and forecasts the next transition, sealed |
| TargetSpace-Bench (this work) | target-conditioned longitudinal world modeling | passive multimodal observation up to sealed T | calibrated forecast over target-state transitions | by design: physical realism, control, generation fidelity | is the complementary layer the other families omit |
Built like the benchmarks researchers trust
We borrow the structure and seriousness of established efforts — not their branding.
Challenge & leaderboard ARC-style
A clear mission, a public leaderboard, and explicit benchmark versions — with contamination-resistant, prospective evaluation rather than a static answer key.
Submissions & splits SWE-bench-style
Defined task splits, a submission pipeline, and a verification path so leaderboard credibility rests on reproducibility, not self-report.
Transparent evaluation HELM-style
An explicit evaluation philosophy: scenarios (tracks/splits), multiple metrics reported side by side, and calibration treated as first-class.
Governance & verification MLCommons-style
Versioned rules, official vs unofficial (public/verified/private-eval) submissions, and organizer-run private evaluation for high-stakes claims.