Benchmark / Baselines

Baselines & controls

TargetSpace is an evaluation protocol, not just a dataset. A result is meaningful only as a battery of conditions — not a single number.

◎

The central signal is not merely whether a model predicts the future, but whether prediction improves when the model is given the correct target's history in the correct temporal order.

The battery

Baseline / control	What it is	What it isolates	Expected signal
Zero-history (R1)	Population prior; current prompt/observation only.	the crowd base rate	entry condition: beat it to play
Short-history	A limited recent window.	where added history first pays	locates the onset of longitudinal value
Longitudinal-history	Extended target history.	the core regime	the condition the benchmark rewards
Own-routine (R2)	The target's recency-weighted routine.	routine vs. genuine adaptation	skill over R2, in bits, is the headline
Shuffled-history control	Correct target, history in scrambled order.	temporal order	skill should drop if order matters
Wrong-target control	History from a different target (permutation gate).	target identity	skill should collapse
Ablated-modality	Remove audio / video / location / text / metadata.	each modality's marginal value	the evidence-tier ablation
Retrieval-only	Surfaces what happened; no target model.	recall vs. understanding	little skill over R2 under the permutation gate
Human baseline	The target or an expert predicting itself.	the achievable range	anchor, not ranked
Oracle / context upper bound	Curated relevant context.	ceiling performance	approximate upper bound

Why ablations matter

If the controls don't move

If target history doesn't help — not measuring its capability.
If temporal order doesn't matter — not using dynamics.
If the wrong target scores the same — not target-specific.
If modality removal is free — the evidence wasn't used.

Then the result should be read as null.

What a positive result must show

Skill improves under longitudinal, target-specific conditions.
Skill degrades under shuffled-history and wrong-target controls.
Calibration holds; confidence tracks uncertainty.
The system is using target-specific dynamics, not generic priors.

This is the inference R2 and the permutation gate are designed to license.

See the battery on the leaderboard

Each mock baseline row reports adaptation gain, calibration, and specificity context.

Leaderboard Docs: baselines