Corpus Data Model

Authoritative reference for the research-corpus artifacts the research-complete

Research Corpus Data Model

Authoritative reference for the research-corpus artifacts the `research-complete`

framework reads and writes. Confirmed against the shared research corpus

warehouse (the reference implementation) and consumed by the shared parser at

`src/artifacts/corpus-views/ref-parser.ts` (#1497).

Corpus root. Artifacts live under `<corpusRoot>/documentation/`. The

corpus root defaults to the project root and is overridable via

`research.corpusRoot` in `.aiwg/aiwg.config` or the `AIWG_CORPUS_ROOT`

environment variable.

Directory layout

<corpusRoot>/
├── documentation/
│   ├── references/   REF-NNN-<slug>.md          analysis docs (one per source)
│   ├── citations/    REF-NNN-citations.md        citation sidecars (edges + funders + discovery)
│   ├── radar/        REF-NNN-radar.md            freshness sidecars
│   ├── profiles/
│   │   ├── people/   PROF-P-<slug>.md            authors/researchers
│   │   ├── orgs/      PROF-O-<slug>.md            organizations
│   │   ├── groups/    PROF-G-<slug>.md            research groups / teams
│   │   ├── funders/   PROF-F-<slug>.md            funding bodies
│   │   └── sources/   PROF-S-<slug>.md            discovery curators
│   ├── areas/        AREA-<slug>.md               research-area taxonomy
│   ├── concepts/     REF-NNN-skos.{ttl,jsonld}    SKOS concept schemes
│   ├── provenance/   records/REF-NNN-prov.{ttl,jsonld}  PROV-O induction bundles
│   └── synthesis/    SYN-NNN-<slug>.md            cross-paper synthesis essays
└── indices/          generated markdown views (by `aiwg index build`, #1490)

Filename rule: references carry a slug (`REF-001-attention.md`); **citation and

radar sidecars do NOT** (`REF-001-citations.md`, `REF-001-radar.md`). The parser

resolves sidecars by `REF-NNN` id, not by slug.

ID conventions

`REF-NNN` — research papers / sources (citation-network nodes).
`PROF-{P|O|G|F|S}-<slug>` — entity profiles (person / org / group / funder / source-curator).
`AREA-<slug>`, `SYN-NNN` — taxonomy areas, synthesis essays.
Reference docs use `ref_id` (underscore) in frontmatter; every sidecar/profile uses hyphenated keys (`ref`, `prof-id`, …).

Artifact schemas

Reference / analysis doc — `references/REF-NNN-<slug>.md`

Frontmatter: `ref_id`, `title`, `year`, `pdf_hash` (+ optional `frontmatter-backfilled[-by]`). Body carries `## GRADE Quality` (A/B/C/LOW), `## Referenced By`, `## Entity Profiles` (Role→Entity→Profile table), `## Document Classification`. Two templates by source type: full academic (`REFERENCE-TEMPLATE.md`) and condensed web/blog (`REFERENCE-TEMPLATE-blog.md`) — see the extensible source-type model (#1509).

Citation sidecar — `citations/REF-NNN-citations.md`

Frontmatter:

`ref`, `title`, `type: citations`
`authors[]` — strings (`"Last, First"`) or objects (`{name, orcid, prof-id}`)
`affiliation-primary`, `affiliation-status` (optional)
`funders[]` — `{id (→ PROF-F slug or raw name), grant-id}` (string entries allowed)
`discovery` (optional, SOURCE-TRACKING) — `{date, surface, via, curator-id (→ PROF-S), harvest-batch, harvested-by}`. `surface` vocab: `x-account|x-search|x-bookmarks|x-foryou|x-following|rss|newsletter|web|referral|direct`.

Body: `## Outgoing` / `## Incoming` edge tables; last column = `Inducted REF` (`REF-NNN` or `—`).

Radar sidecar — `radar/REF-NNN-radar.md`

Cadence→days: `monthly`=30, `quarterly`=90, `biannual`=180, `annual`=365, `on-demand`=never.

Entity profiles — `profiles/{people,orgs,groups,funders}/PROF-{P,O,G,F}-<slug>.md`

Shared frontmatter: `prof-id`, `name`, `type`, `affiliation`, `aliases[]`, `corpus-refs[]`, `refresh-cadence`, `last-refreshed`, `grade-influence` (A–D), `grade-trajectory`, `sources-searched[]`. Orgs add `research-areas[]`; groups add `parent-org`.

`corpus-refs` has two shapes in the wild — list-of-strings (`['REF-1']`) and list-of-dicts (`[{ref, role}]`). The parser (`loadProfiles`) normalizes both to REF-id strings.

Curator profile — `profiles/sources/PROF-S-<slug>.md`

Distinct schema: `prof-id`, `name`, `type: source`, `platform`, `handle`, `url`, `corpus-refs[]`, `surfaces[]`, `focus-areas[]`, `signal-quality` (A–D), `revisit-cadence` (`daily|weekly|biweekly|monthly|on-demand`), `last-harvested`, `candidate-yield`. Slug = handle lowercased, leading punctuation stripped, `_`→`-`.

Bidirectional invariants

Citation `authors[].prof-id` ↔ PROF-P `corpus-refs`.
Citation `discovery.curator-id` ↔ PROF-S `corpus-refs` (orphan-checked: a set `curator-id` requires a matching PROF-S corpus-ref).
Citation `funders[].id` ↔ PROF-F `corpus-refs`.
Reference `## Entity Profiles` table ↔ each profile's `corpus-refs`.

Date / value formats

Dates: ISO `YYYY-MM-DD` (js-yaml parses bare dates as Date objects — the parser coerces to ISO strings).
Cadences: enum words, never durations.
GRADE: `A|B|C|D` (and `HIGH|MODERATE|LOW|VERY LOW` word forms in REF-doc bodies).