Benchmark / Overview

What TargetSpace-Bench measures

TargetSpace-Bench evaluates target-conditioned longitudinal world modeling: the ability to build, update, and use a predictive model of a specific target from passive, multimodal, non-IID observation — measured through calibrated, sealed forecasting.

1capability: become a predictive model of a particular target
R2headline baseline: the target's own routine
5+6domain tracks & evaluation splits
bitsskill, by strictly proper scoring

Definition & scope

A target

A persistent, evolving entity tracked over time: a person, agent, system, organization, environment, or process. The flagship track targets a consenting individual; other tracks target patients, energy assets, embodied agents, and projects.

A target state

The latent configuration a target acts to reach or maintain — a commitment, priority, constraint, or regime — evaluated only through its externally observable consequences. The object to forecast is the transition between target states.

Passive longitudinal observation

Evidence accrued over time without acting on the target: metadata, text, audio, passive multimodal streams, location, physiology. The benchmark measures whether richer, correctly-ordered observation improves target-specific forecasts.

The capability under test

Can a model become a predictive model of a particular target, rather than merely a model of generic scenes or generic events? A high score means calibrated, prospective, target-specific skill — nothing about a target's inner life.

Different from generic dynamics

TargetSpace is not a video-realism, robotics-manipulation, or intuitive-physics benchmark. Those evaluate generic dynamics, realism, and control. TargetSpace evaluates target-specific dynamics: whether prediction improves when a model is given the correct target's history in the correct temporal order. A system can render plausible scenes, generate fluent continuations, or execute dexterous control and still be unable to track this target and forecast where it turns next.

How it complements other benchmark families

The families are complementary, not competing. Each is the right instrument for a different question; TargetSpace is compared with them only on shared axes.

Benchmark familyPrimary object of evaluationTypical inputTypical outputWhat it missesHow TargetSpace complements it
Physical reasoning / intuitive physicsphysical plausibility, object permanence, causality, spatial continuityshort scenes / clipsplausibility or violation judgmenta persistent target; longitudinal adaptation; calibration over timeadds target-specific dynamics over generic physical law
Video generation / world simulationvisual realism, temporal consistency, plausible scene evolutioncontext framesgenerated continuationtarget identity; sealed prospective scoring; proper calibrationscores latent target-state transitions, not surface reconstruction
Embodied roboticsaction utility, policy evaluation, manipulation / control successproprioception, sensors, actionsactions; achieved configurationpassive longitudinal inference; calibration; an own-routine baselinescores passive consequence forecasts, not control
Symbolic / event forecastingprobabilistic prediction of public eventsquestion + contextcalibrated probabilitya tracked individual target; own-routine R2; permutation specificityadds the target as the unit, with R2 + permutation controls
Agent memory / personalizationrecall, preference modeling, retrieval QAhistory / profile + queryheld-out response / preferenceprospective sealing; calibration; transition forecastingscores why an episode matters and forecasts the next transition, sealed
TargetSpace-Bench (this work)target-conditioned longitudinal world modelingpassive multimodal observation up to sealed Tcalibrated forecast over target-state transitionsby design: physical realism, control, generation fidelityis the complementary layer the other families omit

Built like the benchmarks researchers trust

We borrow the structure and seriousness of established efforts — not their branding.

Challenge & leaderboard ARC-style

A clear mission, a public leaderboard, and explicit benchmark versions — with contamination-resistant, prospective evaluation rather than a static answer key.

Submissions & splits SWE-bench-style

Defined task splits, a submission pipeline, and a verification path so leaderboard credibility rests on reproducibility, not self-report.

Transparent evaluation HELM-style

An explicit evaluation philosophy: scenarios (tracks/splits), multiple metrics reported side by side, and calibration treated as first-class.

Governance & verification MLCommons-style

Versioned rules, official vs unofficial (public/verified/private-eval) submissions, and organizer-run private evaluation for high-stakes claims.