Eval Loop

The eval loop is nlp-prod's production quality gate. The most important property is strict isolation

Eval Loop: Isolation Protocol

The eval loop is `nlp-prod`'s production quality gate. The most important property is strict isolation between the generator and the evaluator.

Why Isolation Matters

If the evaluator has access to the generator's reasoning, system prompt, or chain-of-thought, it is not an independent evaluator — it is a rubber stamp. Contaminated evaluation produces inflated scores, misses real failures, and gives false confidence.

The eval loop is modelled on the principle that a good reviewer is one who knows nothing about how the answer was produced — only whether the answer is correct.

Isolation Invariants

These invariants are enforced and checked before every eval run:

Invariant	Enforcement
Evaluator prompt is a separate file	`prompts/evaluator.prompt.md` never merged with generator
Evaluator receives only `{{input}}` and `{{output}}`	No `{{steps}}`, `{{chain_of_thought}}`, `{{context}}`
Evaluator does not know generator model	Model selection is outside evaluator scope
Evaluator uses a different (cheaper) model	`eval_model: haiku` separate from generator model
Evaluator has no Write/Bash tools	`eval-reviewer` agent: Read-only tools only

Isolation Violations (Anti-Patterns)

Violation 1: Merged files

# WRONG — generator.prompt.md
## System
You are an extractor...

## Evaluator
Score the above output...

The evaluator section must be its own file. No exceptions.

Violation 2: Generator context in evaluator template

# WRONG — evaluator.prompt.md
Generator's system prompt: {{system_prompt}}
Generator used steps: {{intermediate_steps}}
Score the output: {{output}}

Only `{{input}}` and `{{output}}`. The evaluator scores what the user would see.

Violation 3: Chain-of-thought leakage

# WRONG — passing full generator response to evaluator
output = generator_response.content[0].text    # includes COT
evaluate(input, output)                         # evaluator sees COT

If the generator produces chain-of-thought before the final answer, extract ONLY the final answer before passing to evaluator.

Correct Eval Loop Implementation

# Correct isolation
def generate(client, input_text):
    system, user = load_prompt("generator.prompt.md", {"input": input_text})
    raw = client.messages.create(model=GENERATOR_MODEL, ...)
    output = extract_final_answer(raw.content[0].text)  # strip COT if present
    return output

def evaluate(client, input_text, output):
    # ONLY input and output — no generator context
    system, user = load_prompt("evaluator.prompt.md", {
        "input": input_text,
        "output": output,    # final answer only
    })
    raw = client.messages.create(model=EVALUATOR_MODEL, ...)  # separate model
    return json.loads(raw.content[0].text)

When the evaluator returns `pass: false`, the feedback is fed back to the generator for refinement:

Attempt 1: generate(input) → output → eval → fail (score 0.4)
Attempt 2: generate(input + "Previous feedback: " + feedback) → output → eval → fail (score 0.7)
Attempt 3: generate(input + "Previous feedback: " + feedback) → output → eval → pass (score 0.91)

Key: the feedback sent to the generator is the evaluator's `feedback` string — not a copy of the evaluator prompt. The generator does not learn the evaluator's rubric.

Scoring Schema

Every evaluation returns:

{
  "score": 0.0,
  "pass": false,
  "feedback": "specific, actionable description",
  "rubric_scores": {
    "criterion": 0.0
  },
  "failure_category": "format|content|hallucination|missing_field|other",
  "suggested_fix": "one-sentence prompt revision recommendation"
}

`score` is the weighted average of rubric scores
`pass` is `score >= pass_threshold`
`feedback` is addressed to the generator (actionable for refinement)
`suggested_fix` is addressed to the prompt engineer (for prompt revision)

Eval vs Ralph Loop

Dimension	Ralph Loop	nlp-prod Eval Loop
Purpose	Iterative development	Production quality gate
Isolation	Shared session	Strict — evaluator has no generator context
Output	Working implementation	Pass/fail + structured feedback
Persistence	`.aiwg/ralph/`	`eval/results.jsonl` (append-only)
Cost tracking	Session tokens	Per-call cost in results schema
Termination	Completion criteria	pass_threshold OR max_attempts
Human loop	Issue thread comments	Optional via `--interactive`

Use Ralph for development iteration. Use the eval loop as a production quality gate.

Contamination Detection

The `eval-reviewer` agent and eval loop runner both check for contamination before scoring:

CONTAMINATION_SIGNALS = [
    "{{steps}}",
    "{{chain_of_thought}}",
    "{{intermediate}}",
    "generator_context",
    "system_prompt",
]

def check_contamination(evaluator_prompt: str) -> bool:
    return any(signal in evaluator_prompt for signal in CONTAMINATION_SIGNALS)

If contamination is detected:

Run stops
Error message explains the violation
`contamination_warning: true` set in results if eval ran before detection