Corpus Data Model

Authoritative reference for the research-corpus artifacts the research-complete

Research Corpus Data Model

Authoritative reference for the research-corpus artifacts the `research-complete`

framework reads and writes. Confirmed against the shared research corpus

warehouse (the reference implementation) and consumed by the shared parser at

`src/artifacts/corpus-views/ref-parser.ts` (#1497).

Corpus root. Artifacts live under `<corpusRoot>/documentation/`. The

corpus root defaults to the project root and is overridable via

`research.corpusRoot` in `.aiwg/aiwg.config` or the `AIWG_CORPUS_ROOT`

environment variable.

Directory layout

<corpusRoot>/
├── documentation/
│   ├── references/   REF-NNN-<slug>.md          analysis docs (one per source)
│   ├── citations/    REF-NNN-citations.md        citation sidecars (edges + funders + discovery)
│   ├── radar/        REF-NNN-radar.md            freshness sidecars
│   ├── profiles/
│   │   ├── people/   PROF-P-<slug>.md            authors/researchers
│   │   ├── orgs/      PROF-O-<slug>.md            organizations
│   │   ├── groups/    PROF-G-<slug>.md            research groups / teams
│   │   ├── funders/   PROF-F-<slug>.md            funding bodies
│   │   └── sources/   PROF-S-<slug>.md            discovery curators
│   ├── areas/        AREA-<slug>.md               research-area taxonomy
│   ├── concepts/     REF-NNN-skos.{ttl,jsonld}    SKOS concept schemes
│   ├── provenance/   records/REF-NNN-prov.{ttl,jsonld}  PROV-O induction bundles
│   └── synthesis/    SYN-NNN-<slug>.md            cross-paper synthesis essays
└── indices/          generated markdown views (by `aiwg index build`, #1490)

Filename rule: references carry a slug (`REF-001-attention.md`); **citation and

radar sidecars do NOT** (`REF-001-citations.md`, `REF-001-radar.md`). The parser

resolves sidecars by `REF-NNN` id, not by slug.

ID conventions

  • `REF-NNN` — research papers / sources (citation-network nodes).
  • `PROF-{P|O|G|F|S}-<slug>` — entity profiles (person / org / group / funder / source-curator).
  • `AREA-<slug>`, `SYN-NNN` — taxonomy areas, synthesis essays.
  • Reference docs use `ref_id` (underscore) in frontmatter; every sidecar/profile uses hyphenated keys (`ref`, `prof-id`, …).

Artifact schemas

Reference / analysis doc — `references/REF-NNN-<slug>.md`

Frontmatter: `ref_id`, `title`, `year`, `pdf_hash` (+ optional `frontmatter-backfilled[-by]`). Body carries `## GRADE Quality` (A/B/C/LOW), `## Referenced By`, `## Entity Profiles` (Role→Entity→Profile table), `## Document Classification`. Two templates by source type: full academic (`REFERENCE-TEMPLATE.md`) and condensed web/blog (`REFERENCE-TEMPLATE-blog.md`) — see the extensible source-type model (#1509).

Citation sidecar — `citations/REF-NNN-citations.md`

Frontmatter:

  • `ref`, `title`, `type: citations`
  • `authors[]` — strings (`"Last, First"`) or objects (`{name, orcid, prof-id}`)
  • `affiliation-primary`, `affiliation-status` (optional)
  • `funders[]` — `{id (→ PROF-F slug or raw name), grant-id}` (string entries allowed)
  • `discovery` (optional, SOURCE-TRACKING) — `{date, surface, via, curator-id (→ PROF-S), harvest-batch, harvested-by}`. `surface` vocab: `x-account|x-search|x-bookmarks|x-foryou|x-following|rss|newsletter|web|referral|direct`.

Body: `## Outgoing` / `## Incoming` edge tables; last column = `Inducted REF` (`REF-NNN` or `—`).

Radar sidecar — `radar/REF-NNN-radar.md`

Frontmatter: `ref`, `title`, `type: radar`, `refresh-cadence` (enum word: `monthly|quarterly|biannual|annual|on-demand`, not a duration), `last-refreshed` (ISO date), `last-refreshed-by`, `cluster`, `grade-original`, `grade-current` (A–D), `grade-trajectory` (`rising|stable|declining|superseded|retracted`, freeform drift in practice), `sources-searched[]`.

Cadence→days: `monthly`=30, `quarterly`=90, `biannual`=180, `annual`=365, `on-demand`=never.

Entity profiles — `profiles/{people,orgs,groups,funders}/PROF-{P,O,G,F}-<slug>.md`

Shared frontmatter: `prof-id`, `name`, `type`, `affiliation`, `aliases[]`, `corpus-refs[]`, `refresh-cadence`, `last-refreshed`, `grade-influence` (A–D), `grade-trajectory`, `sources-searched[]`. Orgs add `research-areas[]`; groups add `parent-org`.

`corpus-refs` has two shapes in the wild — list-of-strings (`['REF-1']`) and list-of-dicts (`[{ref, role}]`). The parser (`loadProfiles`) normalizes both to REF-id strings.

Curator profile — `profiles/sources/PROF-S-<slug>.md`

Distinct schema: `prof-id`, `name`, `type: source`, `platform`, `handle`, `url`, `corpus-refs[]`, `surfaces[]`, `focus-areas[]`, `signal-quality` (A–D), `revisit-cadence` (`daily|weekly|biweekly|monthly|on-demand`), `last-harvested`, `candidate-yield`. Slug = handle lowercased, leading punctuation stripped, `_`→`-`.

Bidirectional invariants

  • Citation `authors[].prof-id` ↔ PROF-P `corpus-refs`.
  • Citation `discovery.curator-id` ↔ PROF-S `corpus-refs` (orphan-checked: a set `curator-id` requires a matching PROF-S corpus-ref).
  • Citation `funders[].id` ↔ PROF-F `corpus-refs`.
  • Reference `## Entity Profiles` table ↔ each profile's `corpus-refs`.

Date / value formats

  • Dates: ISO `YYYY-MM-DD` (js-yaml parses bare dates as Date objects — the parser coerces to ISO strings).
  • Cadences: enum words, never durations.
  • GRADE: `A|B|C|D` (and `HIGH|MODERATE|LOW|VERY LOW` word forms in REF-doc bodies).

See also

  • `src/artifacts/corpus-views/ref-parser.ts` — the shared parser (this model in code).
  • `aiwg index build` — renders the markdown index views from this model (#1490).
  • Source-type extensibility: #1509. Subsystem tooling: radar #1498, discovery #1499, funder #1500.