Reflection & Memory Guide

Comprehensive guide to Al's episodic memory system based on the Reflexion framework.

Reflexion Episodic Memory Guide

Comprehensive guide to Al's episodic memory system based on the Reflexion framework.

Overview

Al implements Reflexion's three-model architecture for verbal reinforcement learning:

1. Actor (Ma) - Executes actions and generates code changes

2. Evaluator (Me) - Verifies results using external tools (npm test, tsc, eslint)

3. Self-Reflection (Msr) - Analyzes failures and generates actionable insights

After each failed iteration, Al generates a structured reflection stored in episodic memory. These reflections are injected into retry attempts, enabling learning without model retraining.

Theoretical Foundation

Research Basis: REF-021 Reflexion - Language Agents with Verbal Reinforcement Learning (NeurIPS 2023)

Key Results:

  • 91% pass@1 on HumanEval (surpasses GPT-4's 80% baseline)
  • +24% task success through verbal reinforcement
  • Learning occurs at inference time through context injection

Core Insight: Natural language reflections provide more actionable guidance than scalar rewards, enabling rapid learning through episodic memory.

See `@$AIWG_ROOT/docs/references/REF-021-reflexion-verbal-reinforcement.md` for complete research documentation.

Three-Model Architecture

1. Actor Model (Ma)

Role: Generates text and actions based on current state and episodic memory.

Policy: `πθ(at|st)` where `θ = {Ma, mem}`

  • `at` = action or text generation at time t
  • `st` = current state (task + trajectory history)
  • `mem` = episodic memory buffer (sliding window of reflections)

Implementation in Al:

interface ActorOutput {
  actions: Action[];           // Sequence of actions taken
  rationale: string;           // Reasoning for approach
  strategy: string;            // High-level strategy
  files_modified: string[];    // Files touched
  total_changes: ChangeStats;  // Aggregate statistics
}

Actor Variants:

  • Chain-of-Thought (CoT): Step-by-step reasoning for single-generation tasks
  • ReAct: Interleaved reasoning and acting for multi-step tasks (Al's default)

2. Evaluator Model (Me)

Role: Scores generated outputs to produce reward signal.

Evaluation Strategies by Task Type:

Task TypeEvaluation MethodAl Implementation
ProgrammingUnit tests + execution`npm test`, `npm run test:coverage`
Type SafetyCompilation checks`tsc --noEmit`
Code QualityLinting`eslint`, `markdownlint`
IntegrationExternal API callsGitea API responses
CombinedMultiple toolsAll of the above

Implementation in Al:

interface EvaluatorOutput {
  passed: boolean;                    // Overall pass/fail
  verification_type: VerificationType; // Type of verification
  results: VerificationResult[];      // Individual tool results
  errors: StructuredError[];          // Parsed error information
  reward_signal: number;              // Scalar reward [0.0, 1.0]
  metrics: VerificationMetrics;       // Quantitative metrics
}

Reward Signal Calculation:

// Example: Combined verification
reward = (
  (tests_passed / tests_total) * 0.5 +
  (type_errors === 0 ? 0.3 : 0.0) +
  (lint_errors === 0 ? 0.2 : 0.0)
);

3. Self-Reflection Model (Msr)

Role: Converts sparse rewards into detailed verbal feedback.

Input: `{trajectory τt, reward rt, episodic memory mem}`

Output: Natural language reflection containing:

1. Credit assignment - Identification of specific failing actions

2. Causal reasoning - Explanation of why actions led to failure

3. Actionable insights - Concrete suggestions for improvement

Implementation in Al:

interface SelfReflection {
  reflection_text: string;           // First-person narrative reflection
  credit_assignment: {
    failing_action_indices: number[]; // Which actions failed
    root_cause: string;                // Identified root cause
    failure_category: FailureCategory; // Error classification
  };
  causal_reasoning: string;          // Why failure occurred
  actionable_insights: string[];     // What to do next
  lessons_learned: string[];         // General lessons
  confidence: number;                // Self-assessed confidence [0.0, 1.0]
  related_reflections: number[];     // Previous relevant reflections
}

Example Reflection (from schema):

"In my previous attempt, I added tests for the login function but didn't account for edge cases where the API response might be empty or undefined. The error 'Cannot read property map of undefined' occurred because I tried to call .map() on userData without first checking if it exists. In my next attempt, I will add null checks before accessing userData properties."

Memory Architecture

Short-Term Memory

Current trajectory history: `τt = [a0, o0, ..., ai, oi]`

  • Represents immediate context and recent decisions
  • Stored in current iteration state

Long-Term Memory (Episodic Buffer)

Reflection storage: `mem = [sr0, sr1, ..., srt]`

  • Maximum capacity Ω (omega) respects context limits
  • Most recent experiences inform future decisions
  • Provides "lessons learned" across trials

Storage Location: `.aiwg/ralph/reflections/<loop-id>/`

Sliding Window Behavior:

Ω=3 example (keep last 3 reflections):

Iteration 0 fails → reflection 001.json (in context)
Iteration 1 fails → reflection 002.json (in context)
Iteration 2 fails → reflection 003.json (in context)
Iteration 3 fails → reflection 004.json (in context)
                     001.json excluded from context
Iteration 4 fails → reflection 005.json (in context)
                     002.json excluded from context

Memory Operations:

// Initialize
let mem: Reflection[] = [];

// After each failed trial
async function afterTrial(iteration: number, trajectory: Trajectory, reward: number) {
  // 1. Generate reflection
  const reflection = await generateReflection(trajectory, reward, mem);

  // 2. Append to memory
  mem.push(reflection);

  // 3. Truncate to Ω capacity
  if (mem.length > OMEGA_CAPACITY) {
    mem = mem.slice(-OMEGA_CAPACITY);
  }

  // 4. Persist to disk
  await saveReflection(reflection);

  // 5. Update metadata
  await updateMemoryMetadata({
    omega_capacity: OMEGA_CAPACITY,
    current_memory_size: mem.length,
    reflections_in_context: mem.map(r => r.iteration),
    total_reflections_generated: iteration + 1
  });
}

When Reflections Are Generated

Reflections are created after failed verification only:

OutcomeGenerate Reflection?Next Action
All verifications passNOComplete loop successfully
Some verifications failYESGenerate reflection → retry
Max iterations reachedYESGenerate final reflection → abort
Critical errorYESGenerate error reflection → abort

Verification Flow:

Attempt → Execute actions → Verify results
                               ↓
                          All passed?
                          /         \
                        YES          NO
                         ↓            ↓
                      Success    Generate reflection
                                      ↓
                                 Add to memory
                                      ↓
                                 Inject into next attempt
                                      ↓
                                    Retry

Reflection Prompt Template

System Prompt for Self-Reflection Model:

You are the Self-Reflection component in a Reflexion-based learning system.

You will be given:
1. Your previous implementation attempt (actions taken)
2. Verification results from external tools (tests, linters, type checker)
3. Your past reflections from similar failures (if any)

Your task is to analyze what went wrong and provide actionable guidance for the next attempt.

## Reflection Structure

Write a first-person narrative reflection that includes:

1. **Credit Assignment**: Which specific actions or code changes caused the failure?
2. **Causal Reasoning**: Why did these actions lead to failure? What was the underlying issue?
3. **Actionable Insights**: What concrete steps should be taken in the next attempt?

Be specific. Avoid generic advice like "be more careful" - instead, identify exact code patterns, missing checks, or logic errors.

## Example Reflection

"In my previous attempt, I tried to map over userData without checking if it exists. The error occurred because the API response was empty in the test case. I should add a null check before the map operation. In the next attempt, I will verify userData exists and return an empty array if it doesn't."

## Previous Reflections

{{#each previous_reflections}}
### Iteration {{this.iteration}}
{{this.reflection_text}}

Lessons learned:
{{#each this.lessons_learned}}
- {{this}}
{{/each}}
{{/each}}

## Current Failure

**Task**: {{task_description}}

**Actions Taken**:
{{#each actions}}
{{@index}}. {{this.description}}
   File: {{this.file_path}}
   Changes: +{{this.changes.additions}} -{{this.changes.deletions}}
{{/each}}

**Verification Results**:
{{#each verification_results}}
- {{this.tool}}: {{this.status}}
  {{#if this.stderr}}
  Error: {{this.stderr}}
  {{/if}}
{{/each}}

**Errors**:
{{#each errors}}
- {{this.type}}: {{this.message}}
  Location: {{this.file}}:{{this.line}}
{{/each}}

Now write your reflection following the structure above.

How to Query Past Reflections

Loading Reflections for Current Task

import { loadReflections } from '@/ralph/memory';

// Load all reflections for a loop
const reflections = await loadReflections('ralph-task-123');

// Get reflections in current window (respects Ω)
const activeReflections = reflections.filter(r =>
  r.memory_metadata.reflections_in_context.includes(r.iteration)
);

// Inject into retry prompt
const context = buildRetryContext({
  task: currentTask,
  previousReflections: activeReflections.map(r => r.self_reflection.reflection_text),
  failedActions: currentFailure.actor_output.actions,
  errors: currentFailure.evaluator_output.errors
});

Cross-Task Learning Patterns

Find similar failures across loops:

import { searchReflections } from '@/ralph/memory';

// Query by failure category
const similarFailures = await searchReflections({
  failure_category: 'edge_case_miss',
  min_confidence: 0.8,
  limit: 5
});

// Extract lessons learned
const lessons = similarFailures.flatMap(r =>
  r.self_reflection.lessons_learned
);

// Inject as general knowledge
const enhancedContext = {
  ...baseContext,
  prior_knowledge: lessons
};

Analyze improvement patterns:

import { analyzePerformance } from '@/ralph/analytics';

// Track reward progression
const loopHistory = await loadReflections('ralph-task-123');
const rewards = loopHistory.map(r => r.evaluator_output.reward_signal);

// Calculate learning rate
const learningRate = (rewards[rewards.length - 1] - rewards[0]) / rewards.length;

// Identify breakthrough moments
const improvements = loopHistory.filter(r =>
  r.performance_delta?.is_improvement === true
);

Example: Learning from Past API Integration Failures

// Scenario: New API integration task
const task = "Integrate Gitea webhook API";

// Step 1: Find past API-related reflections
const apiReflections = await searchReflections({
  task_keywords: ['API', 'integration', 'webhook'],
  failure_category: ['edge_case_miss', 'integration_error'],
  min_confidence: 0.7
});

// Step 2: Extract common lessons
const commonPatterns = extractPatterns(apiReflections, {
  min_frequency: 2, // Lesson appears in ≥2 reflections
  categories: ['actionable_insights', 'lessons_learned']
});

// Step 3: Inject as prior knowledge
const taskContext = {
  task_description: task,
  prior_api_lessons: commonPatterns,
  similar_successes: apiReflections.filter(r =>
    r.evaluator_output.passed === true
  )
};

// Step 4: Execute with enhanced context
const result = await executeWithContext(taskContext);

Memory Capacity Tuning (Ω Parameter)

Choosing Ω based on task complexity:

Task TypeRecommended ΩRationale
Simple programming (single function)1Clear failure modes, quick fixes
Complex programming (multi-file)3Multiple error types, iterative refinement
Decision-making (multi-step)3Long trajectories, credit assignment needed
Reasoning (multi-hop)3Complex causal chains
Research/exploration5+Experimental, may exceed context limits

Empirical Evidence from Reflexion Paper:

  • HumanEval (programming): Ω=1 optimal
  • AlfWorld (decision-making): Ω=3 optimal
  • HotPotQA (reasoning): Ω=3 optimal

AIWG Defaults:

const OMEGA_DEFAULTS = {
  'unit_tests': 1,           // Simple test failures
  'integration_tests': 3,    // Complex integration issues
  'type_check': 1,           // Type errors are usually clear
  'lint': 1,                 // Lint errors are specific
  'combined': 3,             // Multiple verification types
  'manual_review': 5         // Subjective feedback needs history
};

Dynamic Tuning:

// Adjust Ω based on failure diversity
function calculateOptimalOmega(reflections: Reflection[]): number {
  const uniqueCategories = new Set(
    reflections.map(r => r.self_reflection.credit_assignment.failure_category)
  );

  // More diverse failures → larger window
  if (uniqueCategories.size >= 5) return 5;
  if (uniqueCategories.size >= 3) return 3;
  return 1;
}

Performance Analysis

Metrics to Track

Individual Reflection Quality:

  • `self_reflection.confidence` - Self-assessed accuracy
  • `performance_delta.is_improvement` - Did next iteration improve?
  • Correlation between confidence and actual improvement

Loop Performance:

  • Reward progression: `[r0, r1, r2, ..., rn]`
  • Error count reduction over iterations
  • Time to success (iterations needed)
  • Failure category distribution

Cross-Loop Learning:

  • Reuse rate of lessons learned
  • Success rate on tasks similar to past failures
  • Time to success improvement on repeated task types

Example Analysis Script

import { loadReflections, analyzeLoopPerformance } from '@/ralph/analytics';

async function analyzeRalphLearning() {
  // Load all completed loops
  const loops = await loadAllLoops();

  // Analyze each loop
  const analyses = await Promise.all(
    loops.map(async loop => {
      const reflections = await loadReflections(loop.id);

      return {
        loop_id: loop.id,
        total_iterations: reflections.length,
        final_success: loop.status === 'completed',
        learning_curve: reflections.map(r => r.evaluator_output.reward_signal),
        failure_categories: reflections.map(r =>
          r.self_reflection.credit_assignment.failure_category
        ),
        lessons_count: reflections.reduce((sum, r) =>
          sum + r.self_reflection.lessons_learned.length, 0
        ),
        avg_confidence: reflections.reduce((sum, r) =>
          sum + (r.self_reflection.confidence || 0), 0
        ) / reflections.length
      };
    })
  );

  // Aggregate insights
  const totalLearningRate = analyses.reduce((sum, a) => {
    const curve = a.learning_curve;
    const rate = curve.length > 1
      ? (curve[curve.length - 1] - curve[0]) / curve.length
      : 0;
    return sum + rate;
  }, 0) / analyses.length;

  console.log('Al Learning Analysis:');
  console.log(`- Total loops: ${analyses.length}`);
  console.log(`- Success rate: ${analyses.filter(a => a.final_success).length / analyses.length}`);
  console.log(`- Avg learning rate: ${totalLearningRate.toFixed(3)}`);
  console.log(`- Avg iterations: ${analyses.reduce((s, a) => s + a.total_iterations, 0) / analyses.length}`);
}

Integration with Al External

Al's external loop implementation (`tools/ralph-external/`) uses episodic memory for recovery:

Integration Points:

1. Initialization - Load past reflections if resuming

2. Iteration Start - Inject active reflections into context

3. Verification Failure - Generate and store reflection

4. Retry Attempt - Include reflection in next iteration's prompt

5. Completion - Analyze reflection quality and learning curve

File Mapping:

tools/ralph-external/
├── core/
│   ├── memory.ts         # Reflection loading/saving
│   ├── reflection.ts     # Reflection generation
│   └── evaluator.ts      # External verification
├── prompts/
│   ├── actor.hbs         # Includes {{previous_reflections}}
│   └── reflection.hbs    # Self-reflection template
└── state/
    └── <loop-id>/
        └── reflections/  # Symlink to .aiwg/ralph/reflections/<loop-id>/

Reflection Injection Example:

// tools/ralph-external/core/actor.ts
import { loadActiveReflections } from './memory';

async function executeIteration(loopId: string, iteration: number) {
  // Load reflections in window
  const reflections = await loadActiveReflections(loopId);

  // Build context with reflections
  const context = {
    task: currentTask,
    iteration,
    previous_reflections: reflections.map(r => ({
      iteration: r.iteration,
      reflection_text: r.self_reflection.reflection_text,
      lessons_learned: r.self_reflection.lessons_learned,
      actionable_insights: r.self_reflection.actionable_insights
    })),
    previous_errors: reflections.flatMap(r =>
      r.evaluator_output.errors
    )
  };

  // Execute with enhanced context
  const result = await actor.execute(context);

  return result;
}

Best Practices

Writing Quality Reflections

DO:

  • ✅ Use first person ("I tried...", "In my next attempt...")
  • ✅ Be specific about failing actions (cite line numbers, function names)
  • ✅ Explain causal chain (X led to Y because Z)
  • ✅ Provide concrete next steps ("Add null check at line 23")
  • ✅ Reference previous reflections when applicable
  • ✅ Assess confidence honestly

DON'T:

  • ❌ Write generic advice ("Be more careful", "Test thoroughly")
  • ❌ Blame external factors without analysis
  • ❌ Repeat previous reflections verbatim
  • ❌ Ignore verification errors in output
  • ❌ Claim high confidence without evidence

Optimizing Memory Usage

Context Length Management:

// Estimate reflection size
function estimateTokens(reflection: Reflection): number {
  const text = reflection.self_reflection.reflection_text;
  const insights = reflection.self_reflection.actionable_insights.join(' ');
  return Math.ceil((text + insights).length / 4); // Rough estimate
}

// Ensure reflections fit in context window
function pruneReflectionsToFit(
  reflections: Reflection[],
  maxTokens: number
): Reflection[] {
  const sorted = reflections.sort((a, b) => b.iteration - a.iteration);
  let totalTokens = 0;
  const result = [];

  for (const r of sorted) {
    const tokens = estimateTokens(r);
    if (totalTokens + tokens > maxTokens) break;
    result.push(r);
    totalTokens += tokens;
  }

  return result.reverse(); // Maintain chronological order
}

Debugging Reflection Quality

Low-Quality Reflection Indicators:

  • Confidence < 0.5 but is_improvement = false
  • No actionable insights provided
  • Reflection text < 100 characters
  • No credit assignment identified

Improvement Strategies:

1. Enhance reflection prompt with more context

2. Include specific error details in prompt

3. Show examples of high-quality reflections

4. Require minimum reflection length

5. Validate reflection structure before saving

Validation

Schema Validation:

# Validate reflection against schema
npx ajv validate \
  -s agentic/code/addons/ralph/schemas/reflection-memory.json \
  -d .aiwg/ralph/reflections/ralph-task-123/001.json

Runtime Validation:

import Ajv from 'ajv';
import schema from '@/agentic/code/addons/ralph/schemas/reflection-memory.json';

const ajv = new Ajv();
const validate = ajv.compile(schema);

function validateReflection(reflection: unknown): Reflection {
  if (!validate(reflection)) {
    throw new Error(`Invalid reflection: ${ajv.errorsText(validate.errors)}`);
  }
  return reflection as Reflection;
}

References

AIWG Documentation

  • @$AIWG_ROOT/agentic/code/addons/ralph/schemas/reflection-memory.json - JSON Schema definition
  • `@.aiwg/ralph/reflections/.gitkeep` - Directory structure documentation
  • `@.aiwg/ralph/reflections/example/001.json` - Example reflection
  • @$AIWG_ROOT/docs/references/REF-021-reflexion-verbal-reinforcement.md - Research foundation
  • @$AIWG_ROOT/tools/ralph-external/README.md - External loop implementation

Research References

  • Reflexion (NeurIPS 2023) - Shinn et al.
  • 91% HumanEval pass@1 (surpasses GPT-4 baseline)
  • Episodic memory with sliding window (Ω parameter)
  • Three-model architecture: Actor, Evaluator, Self-Reflection
  • arXiv: https://arxiv.org/abs/2303.11366
  • #94 - Parent epic: Reflexion Integration
  • #102 - This implementation
  • #103 - Self-reflection prompt optimization
  • #104 - Cross-task learning analytics

Document Version: 1.0.0

Last Updated: 2026-01-25

Status: IMPLEMENTED

Changelog

DateChange
2026-01-25Initial implementation following REF-021 specification