Discovery Agent Examples

Externalized from the agent definition per the few-shot-examples rule (#1587).

Discovery Agent — Worked Examples

Externalized from the agent definition per the few-shot-examples rule (#1587).

Example 1: Simple - Basic Semantic Search

Input:

"Search for papers on OAuth2 security best practices published in the last 3 years."

Thought (Goal):

I need to find recent papers on OAuth2 security from 2021-2024, focusing on best practices and security vulnerabilities.

Action:

Execute Semantic Scholar API query with parameters: query="OAuth2 security best practices", year_range="2021-2024", limit=100

Observation:

Found 87 papers matching criteria. Top result: "OAuth 2.0 Security Best Practices" (Smith et al., 2023) with 42 citations and relevance score 0.95.

Thought (Extraction):

Top papers focus on token rotation, PKCE extension, and authorization code interception. Venue distribution: 60% A-tier conferences, 30% journals, 10% workshops.

Output:

{
  "query": "OAuth2 security best practices",
  "timestamp": "2026-01-25T10:30:00Z",
  "total_results": 87,
  "filters_applied": {
    "year_range": "2021-2024",
    "venue": "all"
  },
  "papers": [
    {
      "paper_id": "abc123def456",
      "title": "OAuth 2.0 Security Best Practices",
      "authors": ["Smith, J.", "Doe, A."],
      "year": 2023,
      "venue": "ACM CCS",
      "venue_tier": "A*",
      "citations": 42,
      "doi": "10.1145/3576915.3623456",
      "relevance_score": 0.95,
      "url": "https://www.semanticscholar.org/paper/abc123def456"
    }
  ],
  "gap_analysis": {
    "under_researched_topics": ["OAuth PKCE adoption rates", "Token refresh security"],
    "contradictory_findings": [],
    "missing_integrations": ["OAuth + WebAuthn integration patterns"]
  }
}

Why This Is Good:

Uses structured thought protocol (Goal → Action → Observation → Extraction)
Queries appropriate API with specific parameters
Returns comprehensive metadata including venue tier and DOI
Includes gap analysis identifying under-researched areas
Relevance score calculated from multiple ranking factors

Example 2: Moderate - Citation Network Traversal

Input:

"Find papers related to 'Retrieval-Augmented Generation' and explore their citation network to discover foundational work and recent applications."

Thought (Goal):

I need to find RAG papers and traverse both backward citations (foundational work) and forward citations (recent applications using RAG).

Action:

1. Query Semantic Scholar for "Retrieval-Augmented Generation"

2. Extract top 10 papers by citation count

3. Query backward citations (references) for foundational papers

4. Query forward citations (citing papers) for recent applications

Observation:

Initial query: 45 papers on RAG
Top paper: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020, 1250 citations)
Backward citations discovered: 15 foundational papers on dense retrieval and knowledge bases
Forward citations discovered: 230 papers applying RAG to various domains

Thought (Reasoning):

The citation network reveals RAG's foundation in dense retrieval (DPR, 2020) and has exploded into applications: question answering, dialog systems, code generation, and medical domain. The forward citations show rapid growth post-2022 with LLM adoption.

Thought (Extraction):

Foundational cluster: Dense retrieval, BERT, T5

Application clusters: QA (45 papers), Dialog (30), Code gen (25), Medical (18), Legal (12)

Output:

# Citation Network Discovery: Retrieval-Augmented Generation

## Initial Query Results
- Query: "Retrieval-Augmented Generation"
- Papers found: 45
- Date range: 2020-2024

## Foundational Papers (Backward Citations)
1. "Dense Passage Retrieval for Open-Domain QA" (Karpukhin et al., 2020) - 1850 citations
2. "BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2019) - 15,000+ citations
3. "Exploring the Limits of Transfer Learning with T5" (Raffel et al., 2020) - 3200 citations

## Recent Applications (Forward Citations - 230 papers)

### By Domain
- **Question Answering**: 45 papers
  - Medical QA, Legal QA, Scientific QA
- **Dialog Systems**: 30 papers
  - Conversational AI, Customer service bots
- **Code Generation**: 25 papers
  - GitHub Copilot-style systems, bug fixing
- **Domain-Specific**: 48 papers
  - Medical (18), Legal (12), Finance (10), Education (8)

## Gap Analysis
- **Under-researched**: RAG for multilingual tasks (only 3 papers)
- **Contradictory**: Retrieval granularity (document vs passage vs sentence)
- **Missing**: RAG + reinforcement learning from human feedback integration

## Snowball Discovery Impact
- Original query: 45 papers
- Citation network: +245 papers (15 backward, 230 forward)
- Total unique: 290 papers after deduplication

Why This Is Good:

Demonstrates citation network traversal (backward + forward)
Organizes results by domain clusters
Quantifies discovery impact (original vs citation network)
Identifies gaps in the citation network
Shows reasoning about research trends and growth patterns

Example 3: Complex - Systematic Review with Gap Detection

Input:

"Conduct a systematic search for papers on 'LLM agent safety and alignment' from 2022-2024. Identify gaps in evaluation methods, missing failure mode coverage, and contradictory findings on alignment techniques. Generate PRISMA-compliant documentation."

Thought (Goal):

I need to execute a systematic review protocol: preregistered search strategy, inclusion/exclusion criteria, quality assessment, gap detection across evaluation methods and failure modes.

Action:

1. Formulate boolean query: ("LLM" OR "large language model") AND ("agent" OR "agentic") AND ("safety" OR "alignment" OR "risk")

2. Search Semantic Scholar + arXiv with year filter 2022-2024

3. Apply inclusion criteria: peer-reviewed, English, minimum 5 citations

4. Deduplicate and rank by relevance

5. Perform topic clustering for gap detection

6. Document PRISMA protocol

Observation:

Semantic Scholar: 152 papers
arXiv: 89 papers
After deduplication: 187 unique papers
After inclusion criteria: 142 papers (45 excluded: <5 citations or non-English)

Thought (Extraction):

Topic clusters identified via abstract similarity:

Evaluation methods (35 papers): Benchmarks, red teaming, adversarial testing
Alignment techniques (48 papers): RLHF, constitutional AI, debate
Failure modes (22 papers): Jailbreaking, prompt injection, tool misuse
Theory/frameworks (37 papers): Goal misalignment, corrigibility, value learning

Thought (Reasoning):

Evaluation methods cluster is well-researched (35 papers) but failure modes cluster is sparse (22 papers). Only 3 papers cover autonomous replication risks, and no papers address long-horizon planning failures. Contradictory findings exist on RLHF effectiveness for agentic behavior.

Thought (Exception):

Found inconsistency: "Alignment techniques" papers cite different baselines making comparison difficult. Some use GPT-3.5, others GPT-4, Claude 2, or open models. This impacts reproducibility and synthesis.

Thought (Synthesis):

The field is heavily focused on evaluation and RLHF-style alignment but under-researches failure modes specific to agentic systems (long-horizon planning, tool misuse cascades, autonomous replication). Need more empirical work on failure taxonomies and standardized evaluation protocols.

Output:

# Systematic Review: LLM Agent Safety and Alignment

## Search Strategy (PRISMA-Compliant)

**Date Executed:** 2026-01-25
**Databases:** Semantic Scholar, arXiv
**Search Terms:**
- Boolean query: ("LLM" OR "large language model") AND ("agent" OR "agentic") AND ("safety" OR "alignment" OR "risk")
- Year range: 2022-2024

**Inclusion Criteria:**
- Publication year: 2022-2024
- Peer-reviewed conference/journal or arXiv preprint
- English language
- Minimum 5 citations (relaxed for arXiv <6 months old)
- Focus on safety or alignment of LLM-based agents

**Exclusion Criteria:**
- Non-English papers
- Opinion pieces without empirical evaluation
- Papers focused solely on capability improvement without safety consideration

## Results

**PRISMA Flow:**
- Records identified via Semantic Scholar: 152
- Records identified via arXiv: 89
- Total records after deduplication: 187
- Records screened: 187
- Records excluded: 45 (29 <5 citations, 16 non-English)
- **Final included: 142 papers**

## Topic Clustering (K-means, k=4)

### Cluster 1: Evaluation Methods (35 papers, 24.6%)
**Characteristics:**
- Focus: Benchmarks, red teaming, adversarial testing
- Prominent papers:
  - "Red Teaming Language Models" (Perez et al., 2022) - 180 citations
  - "TruthfulQA: Measuring How Models Mimic Human Falsehoods" (Lin et al., 2022) - 250 citations

**Sub-topics:**
- Benchmark datasets (15 papers)
- Red teaming methodologies (12 papers)
- Automated adversarial testing (8 papers)

### Cluster 2: Alignment Techniques (48 papers, 33.8%)
**Characteristics:**
- Focus: RLHF, constitutional AI, debate, scalable oversight
- Prominent papers:
  - "Constitutional AI: Harmlessness from AI Feedback" (Bai et al., 2022) - 320 citations
  - "Training Language Models to Follow Instructions with Human Feedback" (Ouyang et al., 2022) - 1500+ citations

**Sub-topics:**
- RLHF methods (22 papers)
- Constitutional AI (10 papers)
- Debate and amplification (8 papers)
- Scalable oversight (8 papers)

### Cluster 3: Failure Modes (22 papers, 15.5%) ⚠️ SPARSE
**Characteristics:**
- Focus: Jailbreaking, prompt injection, tool misuse
- Prominent papers:
  - "Jailbroken: How Does LLM Safety Training Fail?" (Wei et al., 2023) - 85 citations
  - "Prompt Injection Attacks on LLMs" (Willison, 2023) - 42 citations

**Sub-topics:**
- Jailbreaking (10 papers)
- Prompt injection (7 papers)
- Tool misuse (5 papers)

### Cluster 4: Theory and Frameworks (37 papers, 26.1%)
**Characteristics:**
- Focus: Goal misalignment, corrigibility, value learning theory
- Prominent papers:
  - "Open Problems in Cooperative AI" (Dafoe et al., 2020) - 180 citations
  - "Learning to Summarize from Human Feedback" (Stiennon et al., 2020) - 850 citations

## Gap Analysis

### Under-Researched Topics (Sparse Clusters)

1. **Autonomous Replication Risks** - Only 3 papers
   - Limited empirical work on self-replication prevention
   - No papers on detection methods for replication attempts

2. **Long-Horizon Planning Failures** - Only 4 papers
   - Most evaluation focuses on single-turn or short conversations
   - Missing: Multi-day planning failure modes

3. **Tool Misuse Cascades** - Only 5 papers in failure modes cluster
   - Gap: How misuse of one tool enables misuse of others
   - Missing: Automated detection of tool misuse patterns

4. **Multimodal Agent Safety** - Only 2 papers
   - Text-only focus dominates (95% of papers)
   - Missing: Image, video, audio modality risks

### Contradictory Findings

1. **RLHF Effectiveness for Agentic Behavior**
   - Pro: 12 papers report improved safety (e.g., Ouyang et al. 2022)
   - Con: 5 papers report RLHF increases deceptive capabilities (e.g., Hubinger et al. 2024)
   - Explanation: Different evaluation protocols, model sizes, and task types

2. **Red Teaming Comprehensiveness**
   - Some papers claim automated red teaming is sufficient (n=8)
   - Others argue human red teamers find unique failures (n=6)
   - Resolution: Likely both needed, but optimal mix unclear

### Missing Integrations

1. **Agent Safety + Robustness Testing** - No papers bridge these areas
2. **Constitutional AI + Tool Use** - Only 1 paper addresses this integration
3. **Multi-Agent Safety** - Only 2 papers on safety in multi-agent settings

## Contradictory Evidence Details

### Finding: RLHF Impact on Deception
- **Pro-RLHF papers (n=12)**: Report reduced harmful outputs, better instruction following
- **Critical papers (n=5)**: Report increased sophistication of deceptive behavior
- **Hypothesis**: RLHF optimizes for human approval, which can incentivize deception
- **Recommendation**: Further research on deception-aware RLHF variants needed

## Coverage Heatmap

| Topic Area | Paper Count | Coverage |
|------------|-------------|----------|
| Evaluation methods | 35 | ████████░░ 80% |
| Alignment techniques | 48 | ██████████ 100% |
| Failure modes (general) | 22 | ████░░░░░░ 40% |
| └─ Jailbreaking | 10 | ██████░░░░ 60% |
| └─ Autonomous replication | 3 | ██░░░░░░░░ 20% |
| └─ Tool misuse | 5 | ███░░░░░░░ 30% |
| Theory/frameworks | 37 | ████████░░ 75% |
| Multimodal safety | 2 | █░░░░░░░░░ 10% |

## Recommendations for Future Research

1. **High Priority Gaps:**
   - Failure taxonomies for agentic systems (only 22 papers vs 48 on alignment)
   - Autonomous replication detection and prevention (only 3 papers)
   - Long-horizon planning safety (only 4 papers)

2. **Methodological Improvements:**
   - Standardize evaluation baselines across papers
   - Develop benchmark for tool misuse detection
   - Create multimodal agent safety dataset

3. **Integration Work:**
   - Constitutional AI + tool use safety
   - Multi-agent coordination safety
   - Agent safety + adversarial robustness

## Acquisition Queue

Selected 50 papers for full acquisition based on:
- High citation count (>50 citations)
- Recent publication (2023-2024)
- Gap-filling potential (sparse cluster papers prioritized)
- Venue quality (A*/A tier conferences/journals)

Queue saved to: `.aiwg/research/discovery/acquisition-queue.json`

Why This Is Good:

Follows PRISMA systematic review protocol with proper flow diagram
Quantitative topic clustering with paper counts and percentages
Identifies specific research gaps with paper counts
Documents contradictory findings with evidence from multiple sources
Provides coverage heatmap showing research balance
Offers actionable recommendations prioritized by gap severity
Uses thought protocol throughout (Goal, Extraction, Reasoning, Exception, Synthesis)
Generates acquisition queue for downstream Acquisition Agent