Executable Feedback Guide
Practical guide for implementing the execute-before-return pattern in code-generating agents. Based
Executable Feedback Loop Guide
Practical guide for implementing the execute-before-return pattern in code-generating agents. Based on REF-013 MetaGPT research findings.
Overview
The executable feedback loop ensures code-generating agents test their output before returning it. Research shows this pattern yields +4.2% improvement on HumanEval benchmarks and reduces human revision cost by 63% (from 2.25 revision cycles down to 0.83).
Core Principle
Generate code → Execute tests → Pass? → Return to user
→ Fail? → Analyze → Fix → Re-execute
Never return untested code. Every code artifact must pass through the feedback loop before delivery.
Quick Start
Minimal Feedback Loop
For agents generating code, the minimum viable loop:
# 1. Generate code
code_artifact:
path: "src/utils/validate.ts"
language: typescript
code_type: new_function
# 2. Generate tests
test_files:
- "test/unit/utils/validate.test.ts"
# 3. Execute
execution_config:
test_framework: jest
test_command: "npx jest test/unit/utils/validate.test.ts"
# 4. Retry on failure (max 3 attempts)
retry_policy:
max_attempts: 3
escalation_on_max: true
Integration with Agent Workflow
Agent receives task
│
├─ 1. Check debug memory for similar past work
│ └─ Load patterns from .aiwg/ralph/debug-memory/
│
├─ 2. Generate code
│ └─ Apply learnings from debug memory
│
├─ 3. Generate tests (if not present)
│ ├─ Happy path tests
│ ├─ Edge case tests
│ └─ Error handling tests
│
├─ 4. Execute tests
│ ├─ Capture all output
│ └─ Record in debug memory
│
├─ 5. Analyze results
│ ├─ PASS → Record success, return code
│ └─ FAIL → Analyze, fix, re-execute (up to max_attempts)
│
└─ 6. Finalize
├─ Update debug memory with learnings
└─ Return code (or escalate if max attempts reached)
Phase-by-Phase Walkthrough
Phase 1: Pre-Generation (Check Debug Memory)
Before writing code, check for relevant past experience:
pre_generation:
check_history: true
lookback_window: 10 # Recent executions
patterns_to_check:
- similar_file_edits
- same_test_failures
- recurring_error_types
apply_learnings: true
What to look for:
- Has this file been modified before? What went wrong?
- Are there known patterns for this type of code?
- What error types are common in this module?
Example:
Debug memory shows: src/auth/*.ts has had 3 past sessions
- Pattern: "Missing null check" occurred 2 times
- Fix template: "Add null/undefined guard at function entry"
- Preemptive action: Include null checks in initial generation
Phase 2: Code Generation
Generate code with awareness of past failures:
Agent Thought (Goal): Generate validateInput function for user registration
Agent Thought (Extraction): Debug memory shows null-check pattern for this module
Agent Thought (Reasoning): I'll include null guards upfront to avoid known failure
Phase 3: Test Generation
Generate tests based on code type:
| Code Type | Required Tests |
|---|---|
| New function | Happy path + edge cases + error handling |
| Bug fix | Regression test for the specific bug |
| Refactor | All existing tests must still pass |
| API endpoint | Integration + error cases + validation |
Coverage Requirements:
| Code Type | Minimum Coverage |
|---|---|
| New function | 80% |
| Bug fix | 100% of fix |
| Refactor | Match original |
| API endpoint | 90% |
Phase 4: Test Execution
Execute tests and capture results:
execution:
command: "npx jest test/unit/utils/validate.test.ts --verbose"
timeout: 120s
capture:
- stdout
- stderr
- exit_code
- coverage_report
Record everything — full output is needed for failure analysis.
Phase 5: Failure Analysis
When tests fail, perform structured analysis:
failure_analysis:
- test: "test_null_input"
error_type: "TypeError"
error_message: "Cannot read property 'length' of null"
stack_trace_snippet: "at validateInput (src/utils/validate.ts:42)"
root_cause: "Missing null check in validateInput()"
fix_strategy: "Add null/undefined guard at line 42"
confidence: 0.95
Analysis protocol:
1. Parse error message — What type of error? Where did it occur?
2. Check debug memory — Has this pattern occurred before?
3. Identify root cause — Why did the test fail?
4. Design fix — What specific change will resolve it?
5. Assess confidence — How sure are we this fix is correct?
Do NOT:
- Retry with random changes
- Skip failure analysis
- Apply fixes without understanding root cause
- Ignore patterns from debug memory
Phase 6: Fix and Verify
Apply targeted fix and re-execute:
fix_applied:
description: "Added null check: if (!input) return { valid: false }"
diff_summary: "+3/-0 lines"
files_modified: ["src/utils/validate.ts"]
verification:
command: "npx jest test/unit/utils/validate.test.ts --verbose"
result: all_passing
regression_check: no_new_failures
Regression guard: If previously passing tests start failing after a fix, ABORT. The fix introduced a regression.
Phase 7: Completion or Escalation
On success:
completion:
status: passed
attempts_used: 2
coverage_achieved: 92%
debug_memory_updated: true
learnings_recorded:
- pattern: "Null check at module boundary"
fix_template: "if (!input) return default"
On max attempts reached:
escalation:
status: escalated
attempts_used: 3
include_in_report:
- original_code
- all_test_results
- failure_analyses
- fix_attempts
- debug_memory_summary
notification:
channel: issue_comment
message: |
## Execution Feedback Escalation
**File**: src/utils/validate.ts
**Attempts**: 3/3
### Failures
[Summary of persistent failures]
### Analysis
[Root cause analysis across attempts]
### Attempted Fixes
[What was tried and why it didn't work]
**Human review required**
Debug Memory
Structure
Debug memory persists across sessions in `.aiwg/ralph/debug-memory/`:
.aiwg/ralph/debug-memory/
├── session-abc123.yaml # Individual session records
├── session-def456.yaml
└── patterns/
└── learned-patterns.yaml # Cross-session learnings
Session Record
Each session records the full execution history:
session_id: "abc123"
file_path: "src/utils/validate.ts"
status: passed
created_at: "2026-01-25T10:00:00Z"
executions:
- attempt: 1
test_results:
passed: 6
failed: 2
failures:
- test: "test_null_input"
error_type: "TypeError"
root_cause: "Missing null check"
fix_applied: "Added null guard"
- attempt: 2
test_results:
passed: 8
failed: 0
learnings:
patterns_identified:
- pattern: "Null check missing at module boundary"
frequency: 1
fix_template: "Add null/undefined guard at function entry"
Cross-Session Learning
Agents should query debug memory before generating code:
pre_generation_query:
file: "src/utils/validate.ts"
module: "src/utils/*"
results:
past_sessions: 3
common_patterns:
- "Null check missing" (frequency: 2)
- "Edge case not handled" (frequency: 1)
recommended_preemptive_actions:
- "Include null/undefined guards"
- "Add edge case tests for empty string and whitespace"
Integration with Agent Loop
When the executable feedback loop runs inside an agent loop:
ralph_integration:
# Every Al iteration includes code execution
execution_gate:
require_passing_tests: true
allow_skip: false
# Debug memory persists across Al iterations
debug_memory:
persist_per_iteration: true
cross_iteration_learning: true
# Test pass rate contributes to Al progress metric
progress_metric:
include_test_pass_rate: true
weight: 0.3
Al + Executable Feedback flow:
Al Iteration 1:
├─ Generate code
├─ Run executable feedback loop (up to 3 attempts)
├─ Tests pass? → Al marks progress
└─ Tests fail after 3 attempts? → Al escalates
Al Iteration 2:
├─ Load debug memory from iteration 1
├─ Generate improved code (using learnings)
├─ Run executable feedback loop
└─ Continue...
Escalation Scenarios
Scenario 1: Simple Fix (Attempt 2 Success)
Attempt 1: 6/8 tests pass
Analysis: Missing null check → Fix: Add guard clause
Attempt 2: 8/8 tests pass ✓
Result: Return code to user
Scenario 2: Complex Bug (Escalation After 3 Attempts)
Attempt 1: 4/10 tests pass
Analysis: Race condition in async handler → Fix: Add mutex
Attempt 2: 6/10 tests pass (improved but not fixed)
Analysis: Mutex scope too narrow → Fix: Widen lock scope
Attempt 3: 7/10 tests pass (still failing)
Analysis: Underlying architecture issue
Result: ESCALATE to human with full context
Scenario 3: Regression Detected (Immediate Abort)
Attempt 1: 8/10 tests pass
Analysis: Edge case not handled → Fix: Add boundary check
Attempt 2: 7/10 tests pass (REGRESSION: test_basic_flow now failing)
Result: ABORT — fix introduced regression
Action: Revert to attempt 1 code, escalate
Metrics Dashboard
Track these metrics to monitor feedback loop effectiveness:
| Metric | Target | Current | Status |
|---|---|---|---|
| First-attempt pass rate | >70% | — | — |
| Average attempts to pass | <2.0 | — | — |
| Escalation rate | <10% | — | — |
| Debug memory reuse rate | >30% | — | — |
| Coverage met rate | >90% | — | — |
Metric Definitions
- First-attempt pass rate: % of code that passes all tests on first try
- Average attempts to pass: Mean attempts before success (excluding escalations)
- Escalation rate: % of workflows that exhaust all attempts
- Debug memory reuse rate: % of sessions that benefit from past learnings
- Coverage met rate: % of code meeting minimum coverage requirements
Troubleshooting
"Tests keep failing with the same error"
1. Check if the root cause analysis is correct
2. Review debug memory for similar patterns
3. Verify the fix actually addresses the root cause (not a symptom)
4. If 3 attempts fail: escalate — the issue may be architectural
"Coverage requirement not met"
1. Check if test generation is comprehensive enough
2. Verify coverage tool is configured correctly
3. Add tests for uncovered branches/paths
4. For refactors: compare against original coverage baseline
"Regression detected after fix"
1. ABORT immediately — do not attempt further fixes
2. Revert to pre-fix code state
3. Analyze what the fix broke
4. Escalate with both the original failure and regression context
"Debug memory not finding relevant patterns"
1. Check file path matching (exact vs module-level)
2. Widen search to module or directory level
3. Check error type matching (exact vs category)
4. Debug memory may not have enough history yet
Anti-Patterns
1. Skipping Execution
# WRONG — returning code without testing
Generate code → Return to user
# RIGHT — always execute before return
Generate code → Execute tests → Verify → Return to user
2. Random Retry
# WRONG — no analysis before retry
Test failed → Change something random → Retry
# RIGHT — structured analysis then targeted fix
Test failed → Analyze root cause → Design targeted fix → Retry
3. Ignoring Debug Memory
# WRONG — starting fresh every time
Generate code → Hit same null-check bug → Fix → Next session → Same bug
# RIGHT — learn from history
Check debug memory → Known null-check pattern → Include guard → Tests pass first try
4. Exceeding Retry Limit
# WRONG — infinite retries
max_attempts: 999 # Will waste tokens on unfixable issues
# RIGHT — bounded retries with escalation
max_attempts: 3
escalation_on_max: true # Human takes over
Schema Reference
The executable feedback loop conforms to:
- `@$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/schemas/flows/executable-feedback.yaml` — Workflow schema
- `@$AIWG_ROOT/agentic/code/addons/ralph/schemas/debug-memory.yaml` — Debug memory schema
- `@$AIWG_ROOT/agentic/code/addons/ralph/schemas/actionable-feedback.yaml` — Feedback schema
- `@$AIWG_ROOT/agentic/code/addons/ralph/schemas/iteration-analytics.yaml` — Analytics schema
References
- `@.aiwg/research/findings/REF-013-metagpt.md` — MetaGPT research findings
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/rules/executable-feedback.md — Executable feedback rules
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/schemas/flows/executable-feedback.yaml — Workflow schema
- @$AIWG_ROOT/agentic/code/addons/ralph/schemas/debug-memory.yaml — Debug memory schema
- @$AIWG_ROOT/agentic/code/addons/ralph/docs/reflection-memory-guide.md — Related: Reflexion memory guide
- #101 — Implementation issue