Source Tracking
Purpose: record where each source was discovered (which account / surface), so high-yield curators a
Source / Discovery Tracking Schema
Purpose: record where each source was discovered (which account / surface), so high-yield curators are identified and can be revisited deliberately.
Optional by design: the `discovery:` block is OPTIONAL. Its absence is normal and never an error (see §0).
Tooling: `aiwg corpus discovery-log` (record), `aiwg corpus curator-init` (scaffold PROF-S), `aiwg corpus curator-status` (yield + orphans). Read views: `by-source`, `by-curator` (rendered by `aiwg index build`).
0. Optionality & exemptions (read first)
Discovery metadata is best-effort signal, not a required field. Three cases where it is legitimately absent — none are gaps, none are flagged by audits:
| Case | State | Treatment |
|---|---|---|
| Legacy refs (inducted before source-tracking adoption) | no `discovery:` block | Normal. Not backfilled. Audits ignore. |
| Operator-direct (you brought the paper/source directly) | `surface: direct`, `curator-id: null` (or block omitted) | First-class, curator-less. Never an orphan. |
| Curator unknown (found via search/feed with no clear account) | `surface: x-search`/`x-foryou`/…, `curator-id: null` | Surface recorded, curator left null. Fine. |
Only set a `curator-id` when a source genuinely came through a named, repeatable curator worth returning to. When in doubt, record the `surface` and leave `curator-id` null.
1. Per-paper: `discovery:` block (citation sidecar)
Added to `documentation/citations/REF-XXX-citations.md` frontmatter:
discovery:
date: 2026-05-25 # when the source was first surfaced
surface: x-account # controlled vocab — see below
via: "x.com/@askalphaxiv" # human-readable origin (account/URL/feed)
curator-id: PROF-S-askalphaxiv # link to curator profile; null if no curator
harvest-batch: 2026-05-25-morning # optional: groups a harvesting session
harvested-by: claude-opus-4-7 # agent/human that performed the harvest
All fields except `date` and `surface` are optional.
`surface` controlled vocabulary
| Value | Meaning |
|---|---|
| `x-account` | A specific X account's timeline (curator) |
| `x-search` | X search results (query-driven, often no curator) |
| `x-bookmarks` | Operator's own X bookmarks |
| `x-foryou` | X "For You" algorithmic feed |
| `x-following` | X "Following" feed |
| `rss` | RSS/Atom feed |
| `newsletter` | Email newsletter / digest |
| `web` | Direct web browsing / blog |
| `referral` | Cited by / linked from another corpus paper |
| `direct` | Operator supplied directly (no discovery surface) |
Distinct from radar `sources-searched` (surfaces queried during a freshness refresh): `discovery` records the surface a paper was originally found through. They are orthogonal.
2. Curator: `PROF-S-` source profile
A `source` value in the entity-profile `type` enum, stored in `documentation/profiles/sources/PROF-S-{slug}.md` (see the `source-profile` template).
- slug = handle lowercased, leading punctuation stripped, `_`→`-` (`@_akhaliq` → `PROF-S-akhaliq`).
- `corpus-refs` = inducted REFs discovered via this curator (NOT candidates).
- `signal-quality` = curator signal density (A = paper-per-post, high relevance; … D = low), graded A–D.
- `revisit-cadence` = `daily | weekly | biweekly | monthly | on-demand`.
"Good accounts to return to" = PROF-S ranked by return-to score (inducted-ref count × avg surfaced-paper GRADE) — see `aiwg corpus curator-status`.
3. Bidirectionality + orphan rule
When a paper is inducted with `discovery.curator-id: PROF-S-x`:
1. Add the REF to `PROF-S-x` frontmatter `corpus-refs:` and its §2 "Sources Surfaced" table.
2. The sidecar's `discovery.curator-id` IS the backlink (no separate REF-doc edit).
3. Recompute the curator's yield stats.
A PROF-S referenced by a sidecar's `discovery.curator-id` but missing that REF in its `corpus-refs` is a curator orphan — flagged by `curator-status` (and `research-lint`). The check fires only when `curator-id` is set; a missing/`null`/`direct` discovery block is never an orphan (it is simply outside the discovery graph).
4. Candidate (pre-induction) curator records
Curator profiles may be seeded before their surfaced papers are inducted:
- `corpus-refs: []` (empty until induction)
- record observed candidate yield under §2 "Candidate Sources Surfaced (not yet inducted)".
- As candidates induct, move them into `corpus-refs` + the "Sources Surfaced" table.