Benchmark / Baselines

Baselines & controls

TargetSpace is an evaluation protocol, not just a dataset. A result is meaningful only as a battery of conditions — not a single number.

The central signal is not merely whether a model predicts the future, but whether prediction improves when the model is given the correct target's history in the correct temporal order.

The battery

Baseline / controlWhat it isWhat it isolatesExpected signal
Zero-history (R1)Population prior; current prompt/observation only.the crowd base rateentry condition: beat it to play
Short-historyA limited recent window.where added history first payslocates the onset of longitudinal value
Longitudinal-historyExtended target history.the core regimethe condition the benchmark rewards
Own-routine (R2)The target's recency-weighted routine.routine vs. genuine adaptationskill over R2, in bits, is the headline
Shuffled-history controlCorrect target, history in scrambled order.temporal orderskill should drop if order matters
Wrong-target controlHistory from a different target (permutation gate).target identityskill should collapse
Ablated-modalityRemove audio / video / location / text / metadata.each modality's marginal valuethe evidence-tier ablation
Retrieval-onlySurfaces what happened; no target model.recall vs. understandinglittle skill over R2 under the permutation gate
Human baselineThe target or an expert predicting itself.the achievable rangeanchor, not ranked
Oracle / context upper boundCurated relevant context.ceiling performanceapproximate upper bound

Why ablations matter

If the controls don't move

  • If target history doesn't help — not measuring its capability.
  • If temporal order doesn't matter — not using dynamics.
  • If the wrong target scores the same — not target-specific.
  • If modality removal is free — the evidence wasn't used.

Then the result should be read as null.

What a positive result must show

  • Skill improves under longitudinal, target-specific conditions.
  • Skill degrades under shuffled-history and wrong-target controls.
  • Calibration holds; confidence tracks uncertainty.
  • The system is using target-specific dynamics, not generic priors.

This is the inference R2 and the permutation gate are designed to license.

See the battery on the leaderboard

Each mock baseline row reports adaptation gain, calibration, and specificity context.