A prospective AI benchmark framework

TargetSpace-Bench

Benchmarking personal intelligence through target-specific forecasting under partial observation.

Understanding is a capability. Forecasting is a measurement. The target is not a profile. The target is a lived trajectory.
Status  pre-pilot proposal Results  none claimed Implementation  synthetic only Version  v8.1
The core idea

From profiles to lived trajectories

TargetSpace-Bench evaluates whether AI systems can form, update, and prospectively test calibrated models of specific targets over time.

A person is not a static object

Not a profile, a preference list, a demographic vector, or a memory store. A person is a lived trajectory: an unfolding stream of observable attention, context, constraint, pressure, and emerging goals from which the next action follows.

The benchmark forecasts the trajectory

TargetSpace-Bench asks whether a model can infer that trajectory from observable longitudinal traces well enough to forecast where it turns next — which commitment slips, which priority is displaced, which obligation is engaged or avoided.

🛡️

What this is not. TargetSpace makes no claim about consciousness, qualia, private subjective states, or mind-reading, and a system accesses none of these. It evaluates only predictive alignment with future, externally observable target states — sealed forecasts scored against what actually happens.


Motivation

Why current benchmarks are insufficient

Most evaluations reward predicting what usually happens, or replaying a target's routine. Both can look skilled while revealing nothing specific to this target.

Conventional benchmarks test

  • General knowledge & static task performance
  • Chat quality and retrieval
  • Preference memory & recommendation
  • Style imitation
  • Short-horizon behavioural correlation

TargetSpace-Bench tests

  • Target-specific forecasting of future target-state transitions
  • Skill beyond population base rates (R1)
  • Skill beyond the target's own routine (R2)
  • Calibration of the probability forecast
  • Specificity: skill must collapse under target permutation
  • Evidence value measured by tier
The target-specificity stack

Seven controls, one question

Each layer removes one way a forecast can look target-specific without being so. The two anchors — R2 and the permutation gate — are the contribution; the rest are adopted from established practice.

1
Prospective sealingForecasts are timestamped & hashed before outcomes exist.
removes hindsight
2
Proper scoringLog & Brier scoring of the whole distribution.
removes cherry-picked accuracy
3
R1 — population-prior baselineSkill must exceed population base rates.
removes base rates
4
R2 — own-routine baseline · anchorSkill must exceed the target's own strong routine.
removes routine replay
5
Calibration gateConfident lucky streaks are not understanding.
removes overconfidence
6
Permutation specificity gate · anchorSkill must collapse when forecasts are matched to the wrong target.
removes non-target-specific skill
7
Evidence-tier ablationSkill as a function of evidence tier.
removes unsupported evidence claims

The evaluation loop

How a forecast is scored

Only the sealed forecast and the resolved outcome are scored — never the system's internal explanation or representation.

Evidenceavailable to time t Model / systememits a distribution Sealed forecastover answer space A Outcomeresolves at time r Proper scorelog · Brier (bits) Compare & certify skill over R1 · skill over R2 · calibration gate · permutation gate · evidence-tier ablation target-specific skill
Benchmark design

The forecast unit

A benchmark instance is a tuple. The system outputs a calibrated distribution over a defined answer space; a deterministic rule resolves it later.

i

Target-system instance

A specific partially observed adaptive system tracked over time — here, a consenting individual.

E≤t

Evidence to time t

Only information timestamped at or before t. Strict walk-forward; the future never informs the past.

q

Query

An organizer-issued question about a future target state — not chosen by the entrant.

A

Answer space

A discrete set of outcomes over which the system emits a probability distribution.

r

Resolution time

A future time r > t with a deterministic, externally observable resolution rule.

score

Proper-scored forecast

Sealed before the outcome exists, then graded in bits against R1 and R2.


Flagship track

Personal world modeling

The flagship application is personal intelligence: forecasting a specific person's future target-state transitions. It runs under strict governance.

🤝

Consent & federation

Consenting adults only. The harness runs where the data lives; only sealed forecasts and resolved outcomes leave the client.

📊

Aggregate-only reporting

Reporting is aggregate-only. There is no public raw personal dataset, and none is planned.

Example forecast targets

Commitment — active vs. abandoned.

Priority — maintained vs. displaced.

Task — continuation vs. switch.

Response behaviour — reply, defer, or drop.

Meeting / event — realization vs. no-show.

Obligation — engagement vs. avoidance.

The evidence ladder

Does richer observation help? It is measured, not assumed.

Every tier is scored against the same sealed targets and the same R1/R2 controls. Lift over R2, as a function of tier, is the read-out.

L0Calendar + communications metadata, routine, digital exhaustlower sensitivity
L1Text traces (chat, notes, documents)elevated
L2Audio / transcriptssensitive
L3Passive multimodal (egocentric / scene video, ambient audio, screen)high sensitivity
L4Location / behavioural / mobility traceshigh sensitivity
L5Physiological / specialized sensorsmost sensitive
⚠️

Higher tiers are more sensitive and require stronger consent, local/on-device processing, federation, and governance. Richer evidence is not automatically better — the benchmark measures whether it adds target-specific predictive lift over R2.


Synthetic harness

A runnable demonstration — synthetic only

The current implementation is a minimal, deterministic synthetic harness that exercises the scoring spine end to end. It demonstrates the mechanics; it is not evidence about real people.

What it demonstrates

Synthetic target histories, walk-forward forecast instances, R1/R2 baselines, log & Brier scoring, a calibration diagnostic, a permutation specificity check, and evidence-tier reporting.

What it does not

No human data, no real personal intelligence, no evidence that passive observation helps, no cross-domain validation, no safety validation. The pre-pilot paper reports no empirical results.

A worked synthetic instance

A recurring Wednesday review sits on the calendar, but passive signals show attention redirected to an urgent dependency. Will the review be completed? (all numbers illustrative)

R1 — population prior0.15P(defers)
R2 — own routine0.10P(defers)
Model — reads the shift0.70P(defers)

Outcome: defers. The model gains ≈ +2.2 bits over R1 and ≈ +2.8 bits over R2 — skill the routine did not contain. Scored against a different person who kept their review, the same forecast earns ≈ −1.6 bits: the skill collapses under permutation, confirming it is specific to this target.

The paper

TargetSpace: Benchmarking Personal Intelligence by Target-Specific Forecasting Under Partial Observation

Yuri Andrade Sylvester · Independent Researcher

AI systems increasingly claim to know, remember, personalize to, and act on behalf of particular users, yet current evaluations cannot establish whether a system has formed a durable, calibrated, prospectively useful model of an individual. We call this missing capability personal intelligence. The premise is that a person is not a profile to retrieve but a lived trajectory to forecast — inferred only from observable longitudinal traces. The benchmark scores sealed, proper-scored forecasts of a person's future target-state transitions under partial observation, certified by a stack of controls: a population-prior baseline (R1), a strong own-routine baseline (R2), a calibration gate, and a permutation specificity test, with an evidence-tier ablation and an architecture-neutral grid. The adopted ingredients are cited; the contribution is their conjunction, anchored on R2 and the permutation gate. This is a pre-pilot proposal: no empirical results are reported, only the personal track is implemented (synthetic), and raw data is never public.

  • versionv8.1
  • statusPre-pilot proposal
  • resultsNone claimed
  • implementationSynthetic harness only
  • pages39, incl. appendices A–D
  • frameworkTargetSpace
  • flagship trackPersonal intelligence
  • domainDomain-general in formulation; personal track instantiated
Citation

Cite TargetSpace

@misc{targetspace2026,
  title   = {TargetSpace: Benchmarking Personal Intelligence by
             Target-Specific Forecasting Under Partial Observation},
  author  = {Sylvester, Yuri Andrade},
  year    = {2026},
  note    = {Preprint, v8.1},
  url     = {https://targetspace.org}
}
Code & reproducibility

Open, synthetic, and inspectable

Repository

github.com/yurisyl/targetspace-bench — the site and the synthetic demonstration harness.

Run the harness

Pure Python standard library, deterministic, runs in under a second:
python targetspace_synthetic_demo.py

Status

Pre-pilot. The reference implementation and a participant-data harness remain in preparation. Issues and contributions welcome via GitHub.


Ethics & limits

What a high score does — and does not — mean

No claim on inner life

TargetSpace does not claim access to consciousness, qualia, or private subjective states. It measures predictive alignment from observable longitudinal traces only.

Skill is not permission

Forecasting skill is not deployment legitimacy and confers no licence to act on a person. Benchmark validity is kept strictly separate from deployment legitimacy.

Data stays local

Raw personal data is not public and should not be centralized. Federation keeps raw data under the participant's control.

Governance before any pilot

Any future human pilot requires informed consent, privacy controls, data minimization, local/federated processing, retention limits, bystander handling, and ethics review.