Benchmark / Baselines
Baselines & controls
TargetSpace is an evaluation protocol, not just a dataset. A result is meaningful only as a battery of conditions — not a single number.
◎
The central signal is not merely whether a model predicts the future, but whether prediction improves when the model is given the correct target's history in the correct temporal order.
The battery
| Baseline / control | What it is | What it isolates | Expected signal |
|---|---|---|---|
| Zero-history (R1) | Population prior; current prompt/observation only. | the crowd base rate | entry condition: beat it to play |
| Short-history | A limited recent window. | where added history first pays | locates the onset of longitudinal value |
| Longitudinal-history | Extended target history. | the core regime | the condition the benchmark rewards |
| Own-routine (R2) | The target's recency-weighted routine. | routine vs. genuine adaptation | skill over R2, in bits, is the headline |
| Shuffled-history control | Correct target, history in scrambled order. | temporal order | skill should drop if order matters |
| Wrong-target control | History from a different target (permutation gate). | target identity | skill should collapse |
| Ablated-modality | Remove audio / video / location / text / metadata. | each modality's marginal value | the evidence-tier ablation |
| Retrieval-only | Surfaces what happened; no target model. | recall vs. understanding | little skill over R2 under the permutation gate |
| Human baseline | The target or an expert predicting itself. | the achievable range | anchor, not ranked |
| Oracle / context upper bound | Curated relevant context. | ceiling performance | approximate upper bound |
Why ablations matter
If the controls don't move
- If target history doesn't help — not measuring its capability.
- If temporal order doesn't matter — not using dynamics.
- If the wrong target scores the same — not target-specific.
- If modality removal is free — the evidence wasn't used.
Then the result should be read as null.
What a positive result must show
- Skill improves under longitudinal, target-specific conditions.
- Skill degrades under shuffled-history and wrong-target controls.
- Calibration holds; confidence tracks uncertainty.
- The system is using target-specific dynamics, not generic priors.
This is the inference R2 and the permutation gate are designed to license.
See the battery on the leaderboard
Each mock baseline row reports adaptation gain, calibration, and specificity context.