Benchmark / Overview

What TargetSpace-Bench measures

TargetSpace-Bench evaluates target-conditioned longitudinal world modeling: the ability to build, update, and use a predictive model of a specific target from passive, multimodal, non-IID observation — measured through calibrated, sealed forecasting.

1capability: become a predictive model of a particular target

R2headline baseline: the target's own routine

5+6domain tracks & evaluation splits

bitsskill, by strictly proper scoring

Definition & scope

A target

A persistent, evolving entity tracked over time: a person, agent, system, organization, environment, or process. The flagship track targets a consenting individual; other tracks target patients, energy assets, embodied agents, and projects.

A target state

The latent configuration a target acts to reach or maintain — a commitment, priority, constraint, or regime — evaluated only through its externally observable consequences. The object to forecast is the transition between target states.

Passive longitudinal observation

Evidence accrued over time without acting on the target: metadata, text, audio, passive multimodal streams, location, physiology. The benchmark measures whether richer, correctly-ordered observation improves target-specific forecasts.

The capability under test

Can a model become a predictive model of a particular target, rather than merely a model of generic scenes or generic events? A high score means calibrated, prospective, target-specific skill — nothing about a target's inner life.

Different from generic dynamics

TargetSpace is not a video-realism, robotics-manipulation, or intuitive-physics benchmark. Those evaluate generic dynamics, realism, and control. TargetSpace evaluates target-specific dynamics: whether prediction improves when a model is given the correct target's history in the correct temporal order. A system can render plausible scenes, generate fluent continuations, or execute dexterous control and still be unable to track this target and forecast where it turns next.

How it complements other benchmark families

The families are complementary, not competing. Each is the right instrument for a different question; TargetSpace is compared with them only on shared axes.

Benchmark family	Primary object of evaluation	Typical input	Typical output	What it misses	How TargetSpace complements it
Physical reasoning / intuitive physics	physical plausibility, object permanence, causality, spatial continuity	short scenes / clips	plausibility or violation judgment	a persistent target; longitudinal adaptation; calibration over time	adds target-specific dynamics over generic physical law
Video generation / world simulation	visual realism, temporal consistency, plausible scene evolution	context frames	generated continuation	target identity; sealed prospective scoring; proper calibration	scores latent target-state transitions, not surface reconstruction
Embodied robotics	action utility, policy evaluation, manipulation / control success	proprioception, sensors, actions	actions; achieved configuration	passive longitudinal inference; calibration; an own-routine baseline	scores passive consequence forecasts, not control
Symbolic / event forecasting	probabilistic prediction of public events	question + context	calibrated probability	a tracked individual target; own-routine R2; permutation specificity	adds the target as the unit, with R2 + permutation controls
Agent memory / personalization	recall, preference modeling, retrieval QA	history / profile + query	held-out response / preference	prospective sealing; calibration; transition forecasting	scores why an episode matters and forecasts the next transition, sealed
TargetSpace-Bench (this work)	target-conditioned longitudinal world modeling	passive multimodal observation up to sealed T	calibrated forecast over target-state transitions	by design: physical realism, control, generation fidelity	is the complementary layer the other families omit

Built like the benchmarks researchers trust

We borrow the structure and seriousness of established efforts — not their branding.

◆

Challenge & leaderboard ARC-style

A clear mission, a public leaderboard, and explicit benchmark versions — with contamination-resistant, prospective evaluation rather than a static answer key.

▥

Submissions & splits SWE-bench-style

Defined task splits, a submission pipeline, and a verification path so leaderboard credibility rests on reproducibility, not self-report.

▦

Transparent evaluation HELM-style

An explicit evaluation philosophy: scenarios (tracks/splits), multiple metrics reported side by side, and calibration treated as first-class.

⛓

Governance & verification MLCommons-style

Versioned rules, official vs unofficial (public/verified/private-eval) submissions, and organizer-run private evaluation for high-stakes claims.

See it in practice

Explore tracks Baselines & controls Leaderboard