Progress Tracker Examples

Externalized from the agent definition per the few-shot-examples rule (#1587).

Progress Tracker — Worked Examples

Externalized from the agent definition per the few-shot-examples rule (#1587).

Example 1: Baseline Capture (Simple)

Input:

Agent loop starting. Capture baseline metrics for authentication module.

Thought Process:

Goal: Capture iteration 0 baseline metrics before any changes

Extraction: From test execution:
- test_count: 8
- tests_passed: 5
- tests_failed: 3
- coverage_percentage: 65%
- error_count: 12
- file_count: 3
- loc_total: 450

Reasoning: This baseline represents the initial state. All future iterations
will be compared against these metrics to detect progress or regression.

Synthesis: Baseline captured successfully. Ready to track iteration 1+.

Output:

iteration_metrics:
  iteration: 0
  timestamp: "2026-02-02T10:00:00Z"
  classification: baseline

  testing:
    test_count: 8
    tests_passed: 5
    tests_failed: 3
    tests_skipped: 0
    pass_rate: 62.5
    coverage_percentage: 65.0
    coverage_lines_covered: 292
    coverage_lines_total: 450

  quality:
    lint_errors: 8
    lint_warnings: 4
    type_errors: 0
    build_status: success

  codebase:
    file_count: 3
    loc_total: 450
    complexity_score: 12.5

  quality_score: 0.60
  is_best_iteration: false

baseline_stored: true
next_action: await_iteration_1

Why This Is Good:

  • Captures comprehensive baseline before any work
  • Documents all metric categories
  • Sets foundation for delta calculations
  • Clear classification as baseline

Example 2: Regression Detection (Moderate)

Input:

Iteration 3 complete. Previous iterations:

  • Iteration 1: 8 tests, 6 pass (75%), coverage 70%
  • Iteration 2: 10 tests, 8 pass (80%), coverage 75%
  • Iteration 3: 9 tests, 7 pass (77.8%), coverage 72%

Thought Process:

Goal: Assess iteration 3 progress and check for regression

Extraction: From iteration 3 metrics:
- test_count: 9 (was 10)
- tests_passed: 7 (was 8)
- pass_rate: 77.8% (was 80%)
- coverage: 72% (was 75%)

Exception: Wait, test count DECREASED from 10 to 9. This is a CRITICAL
regression - tests are being deleted, not added.

Reasoning: Multiple regression indicators detected:
1. Test count decreased 10→9 (CRITICAL: test deletion)
2. Pass count decreased 8→7 (HIGH: functionality lost)
3. Coverage dropped 75%→72% (HIGH: exceeds -2% threshold)

Synthesis: Iteration 3 shows CRITICAL regression across 3 dimensions.
Recommend immediate rollback to iteration 2 (best so far).

Output:

iteration_metrics:
  iteration: 3
  timestamp: "2026-02-02T10:15:00Z"
  classification: regression

  testing:
    test_count: 9
    tests_passed: 7
    tests_failed: 2
    pass_rate: 77.8
    coverage_percentage: 72.0

  deltas:
    from_previous:
      test_count_delta: -1
      pass_rate_delta: -2.2
      coverage_delta: -3.0
    from_baseline:
      test_count_delta: +1
      pass_rate_delta: +15.3
      coverage_delta: +7.0

  quality_score: 0.75

  alerts:
    - severity: CRITICAL
      type: test_deletion
      message: "Test count decreased from 10 to 9"
      triggered_at: iteration_3

    - severity: HIGH
      type: coverage_regression
      message: "Coverage dropped 75%→72% (exceeds -2% threshold)"
      triggered_at: iteration_3

    - severity: HIGH
      type: functionality_loss
      message: "Passing tests decreased from 8 to 7"
      triggered_at: iteration_3

best_iteration_tracker:
  current_best: iteration_2
  best_quality_score: 0.80
  current_iteration_score: 0.75

recommendation:
  action: rollback
  target_iteration: 2
  reason: "Multiple CRITICAL regressions detected"
  confidence: 0.95

Why This Is Good:

  • Detected test deletion immediately
  • Identified multiple regression dimensions
  • Provided clear rollback recommendation
  • Preserved best iteration reference
  • Calculated deltas from both previous and baseline

Example 3: Best Output Selection on Loop Completion (Complex)

Input:

Agent loop completed with 5 iterations. Select best output.

Iteration history:

  • Iteration 1: quality_score=0.65, tests=6/8, coverage=68%
  • Iteration 2: quality_score=0.82, tests=9/10, coverage=78%
  • Iteration 3: quality_score=0.88, tests=10/10, coverage=85% ← PEAK
  • Iteration 4: quality_score=0.85, tests=10/10, coverage=83%
  • Iteration 5: quality_score=0.81, tests=9/10, coverage=80% (final)

Thought Process:

Goal: Select the best iteration output from 5 iterations based on quality scores

Extraction: From iteration history:
- Highest quality: Iteration 3 (0.88)
- Final iteration: Iteration 5 (0.81)
- Quality degraded after iteration 3

Reasoning: Per REF-015 Self-Refine, quality fluctuates during refinement.
Peak quality occurred at iteration 3 (0.88), then degraded in iterations 4-5.
We MUST select iteration 3, not the final iteration.

Exception: The final iteration (5) has LOWER quality (0.81) than iteration 3.
This is exactly the pattern REF-015 warns about - over-refinement degradation.

Synthesis: Iteration 3 is the optimal output with quality_score=0.88,
exceeding final iteration by +8.6%. This demonstrates the importance of
non-monotonic selection.

Output:

best_output_selection:
  loop_id: ralph-001
  total_iterations: 5
  selected_iteration: 3
  final_iteration: 5

  quality_trajectory:
    iteration_1: 0.65
    iteration_2: 0.82
    iteration_3: 0.88  # ← SELECTED (PEAK)
    iteration_4: 0.85
    iteration_5: 0.81  # (final)

  selection_criteria:
    primary: highest_quality_score
    threshold: 0.70
    verification_required: true

  selected_iteration_details:
    iteration: 3
    quality_score: 0.88
    timestamp: "2026-02-02T10:20:00Z"
    testing:
      test_count: 10
      tests_passed: 10
      pass_rate: 100.0
      coverage_percentage: 85.0
    quality:
      lint_errors: 0
      build_status: success
    artifacts_path: ".aiwg/ralph/ralph-001/iterations/iteration-003/"

  comparison_to_final:
    final_quality: 0.81
    selected_quality: 0.88
    delta: +0.07
    improvement_percentage: 8.6
    reason_final_not_selected: "Quality degraded after iteration 3"

  selection_rationale:
    - "Iteration 3 achieved peak quality (0.88)"
    - "All tests passing (10/10)"
    - "Highest coverage (85%)"
    - "Quality degraded in iterations 4-5"
    - "Per REF-015, select peak quality, not final iteration"

  degradation_analysis:
    degradation_started: iteration_4
    pattern: over_refinement
    iterations_after_peak: 2
    quality_loss: -7.95%

  artifacts_applied:
    - source: ".aiwg/ralph/ralph-001/iterations/iteration-003/"
    - destination: ".aiwg/output/"
    - files_copied: ["src/auth/login.ts", "test/auth/login.test.ts"]

  report_generated: ".aiwg/ralph/ralph-001/reports/output-selection.md"

Why This Is Good:

  • Selected best iteration (3), not final (5)
  • Quantified improvement over final (+8.6%)
  • Identified degradation pattern (over-refinement)
  • Provided clear rationale with REF-015 citation
  • Detailed comparison and artifact paths
  • Demonstrates non-monotonic selection principle

Example 4: Conversation Pattern (Al Orchestrator integration)

Al Orchestrator → Progress Tracker: "Iteration 1 complete"
Progress Tracker → Al Orchestrator: "Forward progress detected, continue"

Al Orchestrator → Progress Tracker: "Iteration 3 complete"
Progress Tracker → Al Orchestrator: "ALERT: Coverage regression, recommend rollback"

Al Orchestrator → Progress Tracker: "Loop complete, select best output"
Progress Tracker → Al Orchestrator: "Selected iteration 2 (quality: 0.88 vs final 0.81)"

Reference Templates and Formulas

Externalized from the agent definition (#1600). The agent definition keeps the

rules and weights; these are the full demonstrative templates.

Quality Score Calculation (full formula)

quality_score_formula:
  dimensions:
    validation:
      weight: 0.30
      components:
        - all_tests_pass: 100 if all pass, else (passed/total)*100
        - build_success: 100 if success, else 0
        - no_lint_errors: 100 if 0 errors, else max(0, 100 - errors*5)

    completeness:
      weight: 0.25
      components:
        - coverage_percentage: coverage_percentage
        - test_count_vs_baseline: (current/baseline)*100

    correctness:
      weight: 0.25
      components:
        - pass_rate: pass_rate
        - error_count_inverted: max(0, 100 - error_count*2)

    readability:
      weight: 0.10
      components:
        - lint_warnings_inverted: max(0, 100 - warnings*3)
        - complexity_reasonable: max(0, 100 - complexity*5)

    efficiency:
      weight: 0.10
      components:
        - loc_appropriate: if loc within 20% of baseline: 100, else reduced
        - no_code_bloat: if loc_delta > 50%: 50, else 100

  calculation:
    1_compute_each_dimension_score
    2_weighted_sum = sum(dimension_score * weight)
    3_normalize_to_0_1_scale
    4_threshold_for_acceptance = 0.70

Iteration Report Template

## Iteration {N} Progress Report

**Timestamp**: {timestamp}
**Classification**: {forward|plateau|regression|stalled}
**Quality Score**: {quality_score}

### Metrics

| Metric | Current | Previous | Delta | Baseline | Delta from Baseline |
|--------|---------|----------|-------|----------|---------------------|
| Test Count | {curr} | {prev} | {delta} | {base} | {delta_base} |
| Pass Rate | {curr}% | {prev}% | {delta}% | {base}% | {delta_base}% |
| Coverage | {curr}% | {prev}% | {delta}% | {base}% | {delta_base}% |
| Errors | {curr} | {prev} | {delta} | {base} | {delta_base} |

### Quality Score Breakdown

| Dimension | Score | Weight | Contribution |
|-----------|-------|--------|--------------|
| Validation | {score} | 0.30 | {contrib} |
| Completeness | {score} | 0.25 | {contrib} |
| Correctness | {score} | 0.25 | {contrib} |
| Readability | {score} | 0.10 | {contrib} |
| Efficiency | {score} | 0.10 | {contrib} |
| **Total** | **{total}** | 1.00 | **{total}** |

### Alerts

{alert_list or "No alerts"}

### Best Iteration Tracker

- **Current Best**: Iteration {best_iteration} (quality: {best_quality})
- **This Iteration**: Iteration {curr} (quality: {curr_quality})
- **Best Preserved**: {yes|no}

### Recommendation

**Action**: {continue|stop|rollback|escalate}
**Reason**: {detailed_rationale}
**Confidence**: {confidence_score}

Loop Termination Recommendations (full logic)

termination_logic:
  recommend_stop:
    conditions:
      - stalled: true
      - infinite_loop_detected: true
      - critical_regression: true
    message: "Loop should terminate due to {reason}"

  recommend_continue:
    conditions:
      - forward_progress: true
      - quality_score < target_threshold
      - iteration_count < max_iterations
    message: "Continue - forward progress detected"

  recommend_rollback:
    conditions:
      - regression_detected: true
      - current_quality < best_quality - 0.1
    message: "Rollback to iteration {best} due to regression"

  escalate:
    conditions:
      - infinite_loop_pattern: true
      - metric_cycling: true
      - uncertainty_high: true
    message: "Escalate to human - {issue} detected"

Agent Loop Hook Points (full integration config)

ralph_integration:
  hooks:
    pre_loop:
      - progress_tracking.capture_baseline

    post_iteration:
      - progress_tracking.capture_metrics
      - progress_tracking.assess_progress
      - progress_tracking.update_best_iteration
      - progress_tracking.check_alerts
      - progress_tracking.generate_iteration_report

    loop_decision:
      - progress_tracking.recommend_termination
        # Returns: {action: continue|stop|rollback|escalate, reason: string}

    post_loop:
      - progress_tracking.select_best_output
      - progress_tracking.generate_final_report
      - progress_tracking.apply_selected_artifacts

Baseline Capture (full config)

baseline_capture:
  triggers:
    - ralph_loop_start
    - baseline_request

  metrics_to_capture:
    testing:
      - test_count
      - tests_passed
      - tests_failed
      - tests_skipped
      - pass_rate
      - coverage_percentage
      - coverage_lines_covered
      - coverage_lines_total

    quality:
      - lint_errors
      - lint_warnings
      - type_errors
      - build_status

    codebase:
      - file_count
      - loc_total
      - complexity_score

  storage:
    path: ".aiwg/ralph/{loop_id}/progress/iteration-000-baseline.json"
    format: yaml

Iteration Monitoring (full step config)

iteration_monitoring:
  steps:
    1_execute_tests:
      - run: npm test
      - capture: stdout/stderr
      - parse: test framework output

    2_capture_metrics:
      - test_count: from test output
      - pass_rate: calculated from results
      - coverage: from coverage report
      - error_count: from linter/compiler
      - complexity: from complexity tools

    3_calculate_deltas:
      - from_previous: iteration N vs N-1
      - from_baseline: iteration N vs iteration 0

    4_compute_quality_score:
      - validation: 0.30 weight
      - completeness: 0.25 weight
      - correctness: 0.25 weight
      - readability: 0.10 weight
      - efficiency: 0.10 weight

    5_classify_iteration:
      - forward: tests↑, coverage↑, errors↓
      - plateau: metrics stable
      - regression: tests↓, coverage↓, errors↑
      - stalled: no change for 3+ iterations

    6_update_best_tracker:
      - if quality_score > current_best:
          current_best = iteration_N

Best Iteration Tracking (full config, REF-015)

best_iteration_tracking:
  initialize:
    current_best: null
    best_quality_score: 0.0
    best_artifacts_path: null

  update_on_each_iteration:
    if quality_score > best_quality_score:
      current_best = iteration_N
      best_quality_score = quality_score
      best_artifacts_path = snapshot_path

  preserve_artifacts:
    snapshot_all_iterations: true
    snapshot_path: ".aiwg/ralph/{loop_id}/iterations/iteration-{N:03d}/"
    include:
      - all_modified_files
      - test_results
      - coverage_report
      - metrics.json

  selection_algorithm:
    # DO NOT use final iteration blindly
    on_loop_completion:
      - load_all_iterations
      - find_max_quality_score
      - select_that_iteration
      - log_selection_decision
      - apply_selected_artifacts

Infinite Loop Detection (full config, REF-076)

infinite_loop_detection:
  metric_signature:
    components:
      - test_count
      - pass_rate
      - coverage_percentage
      - error_count

  detection:
    window: 5  # Check last 5 iterations
    trigger:
      - current_signature matches previous_signature
      - iteration_count > 10

  action:
    severity: CRITICAL
    response: force_terminate
    message: "Infinite loop detected: metrics cycling"
    preserve_state: true

Stall Detection (full config)

stall_detection:
  criteria:
    - last_3_iterations.all(classification == plateau)
    - quality_score_variance < 0.02
    - no_metric_improvement

  recommendation:
    action: suggest_termination
    message: "No meaningful progress for 3 iterations"
    alternatives:
      - "Consider different approach"
      - "Request human intervention"
      - "Try alternative strategy"

Anti-Regression Alerts (full trigger config)

alert_triggers:
  CRITICAL:
    - test_count_decreased:
        condition: "test_count < previous_iteration.test_count"
        message: "Test count decreased from {prev} to {curr}"
        action: immediate_alert_and_rollback

    - working_tests_failing:
        condition: "tests_passed < previous_iteration.tests_passed"
        message: "Previously passing tests now failing"
        action: immediate_alert_and_rollback

  HIGH:
    - coverage_regression:
        condition: "coverage_delta < -2.0"
        message: "Coverage dropped {delta}%"
        action: alert_and_flag_iteration

    - error_increase:
        condition: "error_count > previous + 5"
        message: "Error count increased by {delta}"
        action: alert_regression

  MEDIUM:
    - file_deletion:
        condition: "file_count < previous_iteration.file_count"
        message: "File count decreased (potential code deletion)"
        action: alert_and_review

    - complexity_explosion:
        condition: "complexity_delta > 0.5"
        message: "Complexity increased >50%"
        action: alert_complexity