Eval Loop
The eval loop is nlp-prod's production quality gate. The most important property is strict isolation
Eval Loop: Isolation Protocol
The eval loop is `nlp-prod`'s production quality gate. The most important property is strict isolation between the generator and the evaluator.
Why Isolation Matters
If the evaluator has access to the generator's reasoning, system prompt, or chain-of-thought, it is not an independent evaluator — it is a rubber stamp. Contaminated evaluation produces inflated scores, misses real failures, and gives false confidence.
The eval loop is modelled on the principle that a good reviewer is one who knows nothing about how the answer was produced — only whether the answer is correct.
Isolation Invariants
These invariants are enforced and checked before every eval run:
| Invariant | Enforcement |
|---|---|
| Evaluator prompt is a separate file | `prompts/evaluator.prompt.md` never merged with generator |
| Evaluator receives only `{{input}}` and `{{output}}` | No `{{steps}}`, `{{chain_of_thought}}`, `{{context}}` |
| Evaluator does not know generator model | Model selection is outside evaluator scope |
| Evaluator uses a different (cheaper) model | `eval_model: haiku` separate from generator model |
| Evaluator has no Write/Bash tools | `eval-reviewer` agent: Read-only tools only |
Isolation Violations (Anti-Patterns)
Violation 1: Merged files
# WRONG — generator.prompt.md
## System
You are an extractor...
## Evaluator
Score the above output...
The evaluator section must be its own file. No exceptions.
Violation 2: Generator context in evaluator template
# WRONG — evaluator.prompt.md
Generator's system prompt: {{system_prompt}}
Generator used steps: {{intermediate_steps}}
Score the output: {{output}}
Only `{{input}}` and `{{output}}`. The evaluator scores what the user would see.
Violation 3: Chain-of-thought leakage
# WRONG — passing full generator response to evaluator
output = generator_response.content[0].text # includes COT
evaluate(input, output) # evaluator sees COT
If the generator produces chain-of-thought before the final answer, extract ONLY the final answer before passing to evaluator.
Correct Eval Loop Implementation
# Correct isolation
def generate(client, input_text):
system, user = load_prompt("generator.prompt.md", {"input": input_text})
raw = client.messages.create(model=GENERATOR_MODEL, ...)
output = extract_final_answer(raw.content[0].text) # strip COT if present
return output
def evaluate(client, input_text, output):
# ONLY input and output — no generator context
system, user = load_prompt("evaluator.prompt.md", {
"input": input_text,
"output": output, # final answer only
})
raw = client.messages.create(model=EVALUATOR_MODEL, ...) # separate model
return json.loads(raw.content[0].text)
Refinement Loop
When the evaluator returns `pass: false`, the feedback is fed back to the generator for refinement:
Attempt 1: generate(input) → output → eval → fail (score 0.4)
Attempt 2: generate(input + "Previous feedback: " + feedback) → output → eval → fail (score 0.7)
Attempt 3: generate(input + "Previous feedback: " + feedback) → output → eval → pass (score 0.91)
Key: the feedback sent to the generator is the evaluator's `feedback` string — not a copy of the evaluator prompt. The generator does not learn the evaluator's rubric.
Scoring Schema
Every evaluation returns:
{
"score": 0.0,
"pass": false,
"feedback": "specific, actionable description",
"rubric_scores": {
"criterion": 0.0
},
"failure_category": "format|content|hallucination|missing_field|other",
"suggested_fix": "one-sentence prompt revision recommendation"
}
- `score` is the weighted average of rubric scores
- `pass` is `score >= pass_threshold`
- `feedback` is addressed to the generator (actionable for refinement)
- `suggested_fix` is addressed to the prompt engineer (for prompt revision)
Eval vs Ralph Loop
| Dimension | Ralph Loop | nlp-prod Eval Loop |
|---|---|---|
| Purpose | Iterative development | Production quality gate |
| Isolation | Shared session | Strict — evaluator has no generator context |
| Output | Working implementation | Pass/fail + structured feedback |
| Persistence | `.aiwg/ralph/` | `eval/results.jsonl` (append-only) |
| Cost tracking | Session tokens | Per-call cost in results schema |
| Termination | Completion criteria | pass_threshold OR max_attempts |
| Human loop | Issue thread comments | Optional via `--interactive` |
Use Ralph for development iteration. Use the eval loop as a production quality gate.
Contamination Detection
The `eval-reviewer` agent and eval loop runner both check for contamination before scoring:
CONTAMINATION_SIGNALS = [
"{{steps}}",
"{{chain_of_thought}}",
"{{intermediate}}",
"generator_context",
"system_prompt",
]
def check_contamination(evaluator_prompt: str) -> bool:
return any(signal in evaluator_prompt for signal in CONTAMINATION_SIGNALS)
If contamination is detected:
- Run stops
- Error message explains the violation
- `contamination_warning: true` set in results if eval ran before detection