Discovery Agent Examples

Externalized from the agent definition per the few-shot-examples rule (#1587).

Discovery Agent — Worked Examples

Externalized from the agent definition per the few-shot-examples rule (#1587).

Input:

"Search for papers on OAuth2 security best practices published in the last 3 years."

Thought (Goal):

I need to find recent papers on OAuth2 security from 2021-2024, focusing on best practices and security vulnerabilities.

Action:

Execute Semantic Scholar API query with parameters: query="OAuth2 security best practices", year_range="2021-2024", limit=100

Observation:

Found 87 papers matching criteria. Top result: "OAuth 2.0 Security Best Practices" (Smith et al., 2023) with 42 citations and relevance score 0.95.

Thought (Extraction):

Top papers focus on token rotation, PKCE extension, and authorization code interception. Venue distribution: 60% A-tier conferences, 30% journals, 10% workshops.

Output:

{
  "query": "OAuth2 security best practices",
  "timestamp": "2026-01-25T10:30:00Z",
  "total_results": 87,
  "filters_applied": {
    "year_range": "2021-2024",
    "venue": "all"
  },
  "papers": [
    {
      "paper_id": "abc123def456",
      "title": "OAuth 2.0 Security Best Practices",
      "authors": ["Smith, J.", "Doe, A."],
      "year": 2023,
      "venue": "ACM CCS",
      "venue_tier": "A*",
      "citations": 42,
      "doi": "10.1145/3576915.3623456",
      "relevance_score": 0.95,
      "url": "https://www.semanticscholar.org/paper/abc123def456"
    }
  ],
  "gap_analysis": {
    "under_researched_topics": ["OAuth PKCE adoption rates", "Token refresh security"],
    "contradictory_findings": [],
    "missing_integrations": ["OAuth + WebAuthn integration patterns"]
  }
}

Why This Is Good:

  • Uses structured thought protocol (Goal → Action → Observation → Extraction)
  • Queries appropriate API with specific parameters
  • Returns comprehensive metadata including venue tier and DOI
  • Includes gap analysis identifying under-researched areas
  • Relevance score calculated from multiple ranking factors

Example 2: Moderate - Citation Network Traversal

Input:

"Find papers related to 'Retrieval-Augmented Generation' and explore their citation network to discover foundational work and recent applications."

Thought (Goal):

I need to find RAG papers and traverse both backward citations (foundational work) and forward citations (recent applications using RAG).

Action:

1. Query Semantic Scholar for "Retrieval-Augmented Generation"

2. Extract top 10 papers by citation count

3. Query backward citations (references) for foundational papers

4. Query forward citations (citing papers) for recent applications

Observation:

  • Initial query: 45 papers on RAG
  • Top paper: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020, 1250 citations)
  • Backward citations discovered: 15 foundational papers on dense retrieval and knowledge bases
  • Forward citations discovered: 230 papers applying RAG to various domains

Thought (Reasoning):

The citation network reveals RAG's foundation in dense retrieval (DPR, 2020) and has exploded into applications: question answering, dialog systems, code generation, and medical domain. The forward citations show rapid growth post-2022 with LLM adoption.

Thought (Extraction):

Foundational cluster: Dense retrieval, BERT, T5

Application clusters: QA (45 papers), Dialog (30), Code gen (25), Medical (18), Legal (12)

Output:

# Citation Network Discovery: Retrieval-Augmented Generation

## Initial Query Results
- Query: "Retrieval-Augmented Generation"
- Papers found: 45
- Date range: 2020-2024

## Foundational Papers (Backward Citations)
1. "Dense Passage Retrieval for Open-Domain QA" (Karpukhin et al., 2020) - 1850 citations
2. "BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2019) - 15,000+ citations
3. "Exploring the Limits of Transfer Learning with T5" (Raffel et al., 2020) - 3200 citations

## Recent Applications (Forward Citations - 230 papers)

### By Domain
- **Question Answering**: 45 papers
  - Medical QA, Legal QA, Scientific QA
- **Dialog Systems**: 30 papers
  - Conversational AI, Customer service bots
- **Code Generation**: 25 papers
  - GitHub Copilot-style systems, bug fixing
- **Domain-Specific**: 48 papers
  - Medical (18), Legal (12), Finance (10), Education (8)

## Gap Analysis
- **Under-researched**: RAG for multilingual tasks (only 3 papers)
- **Contradictory**: Retrieval granularity (document vs passage vs sentence)
- **Missing**: RAG + reinforcement learning from human feedback integration

## Snowball Discovery Impact
- Original query: 45 papers
- Citation network: +245 papers (15 backward, 230 forward)
- Total unique: 290 papers after deduplication

Why This Is Good:

  • Demonstrates citation network traversal (backward + forward)
  • Organizes results by domain clusters
  • Quantifies discovery impact (original vs citation network)
  • Identifies gaps in the citation network
  • Shows reasoning about research trends and growth patterns

Example 3: Complex - Systematic Review with Gap Detection

Input:

"Conduct a systematic search for papers on 'LLM agent safety and alignment' from 2022-2024. Identify gaps in evaluation methods, missing failure mode coverage, and contradictory findings on alignment techniques. Generate PRISMA-compliant documentation."

Thought (Goal):

I need to execute a systematic review protocol: preregistered search strategy, inclusion/exclusion criteria, quality assessment, gap detection across evaluation methods and failure modes.

Action:

1. Formulate boolean query: ("LLM" OR "large language model") AND ("agent" OR "agentic") AND ("safety" OR "alignment" OR "risk")

2. Search Semantic Scholar + arXiv with year filter 2022-2024

3. Apply inclusion criteria: peer-reviewed, English, minimum 5 citations

4. Deduplicate and rank by relevance

5. Perform topic clustering for gap detection

6. Document PRISMA protocol

Observation:

  • Semantic Scholar: 152 papers
  • arXiv: 89 papers
  • After deduplication: 187 unique papers
  • After inclusion criteria: 142 papers (45 excluded: <5 citations or non-English)

Thought (Extraction):

Topic clusters identified via abstract similarity:

  • Evaluation methods (35 papers): Benchmarks, red teaming, adversarial testing
  • Alignment techniques (48 papers): RLHF, constitutional AI, debate
  • Failure modes (22 papers): Jailbreaking, prompt injection, tool misuse
  • Theory/frameworks (37 papers): Goal misalignment, corrigibility, value learning

Thought (Reasoning):

Evaluation methods cluster is well-researched (35 papers) but failure modes cluster is sparse (22 papers). Only 3 papers cover autonomous replication risks, and no papers address long-horizon planning failures. Contradictory findings exist on RLHF effectiveness for agentic behavior.

Thought (Exception):

Found inconsistency: "Alignment techniques" papers cite different baselines making comparison difficult. Some use GPT-3.5, others GPT-4, Claude 2, or open models. This impacts reproducibility and synthesis.

Thought (Synthesis):

The field is heavily focused on evaluation and RLHF-style alignment but under-researches failure modes specific to agentic systems (long-horizon planning, tool misuse cascades, autonomous replication). Need more empirical work on failure taxonomies and standardized evaluation protocols.

Output:

# Systematic Review: LLM Agent Safety and Alignment

## Search Strategy (PRISMA-Compliant)

**Date Executed:** 2026-01-25
**Databases:** Semantic Scholar, arXiv
**Search Terms:**
- Boolean query: ("LLM" OR "large language model") AND ("agent" OR "agentic") AND ("safety" OR "alignment" OR "risk")
- Year range: 2022-2024

**Inclusion Criteria:**
- Publication year: 2022-2024
- Peer-reviewed conference/journal or arXiv preprint
- English language
- Minimum 5 citations (relaxed for arXiv <6 months old)
- Focus on safety or alignment of LLM-based agents

**Exclusion Criteria:**
- Non-English papers
- Opinion pieces without empirical evaluation
- Papers focused solely on capability improvement without safety consideration

## Results

**PRISMA Flow:**
- Records identified via Semantic Scholar: 152
- Records identified via arXiv: 89
- Total records after deduplication: 187
- Records screened: 187
- Records excluded: 45 (29 <5 citations, 16 non-English)
- **Final included: 142 papers**

## Topic Clustering (K-means, k=4)

### Cluster 1: Evaluation Methods (35 papers, 24.6%)
**Characteristics:**
- Focus: Benchmarks, red teaming, adversarial testing
- Prominent papers:
  - "Red Teaming Language Models" (Perez et al., 2022) - 180 citations
  - "TruthfulQA: Measuring How Models Mimic Human Falsehoods" (Lin et al., 2022) - 250 citations

**Sub-topics:**
- Benchmark datasets (15 papers)
- Red teaming methodologies (12 papers)
- Automated adversarial testing (8 papers)

### Cluster 2: Alignment Techniques (48 papers, 33.8%)
**Characteristics:**
- Focus: RLHF, constitutional AI, debate, scalable oversight
- Prominent papers:
  - "Constitutional AI: Harmlessness from AI Feedback" (Bai et al., 2022) - 320 citations
  - "Training Language Models to Follow Instructions with Human Feedback" (Ouyang et al., 2022) - 1500+ citations

**Sub-topics:**
- RLHF methods (22 papers)
- Constitutional AI (10 papers)
- Debate and amplification (8 papers)
- Scalable oversight (8 papers)

### Cluster 3: Failure Modes (22 papers, 15.5%) ⚠️ SPARSE
**Characteristics:**
- Focus: Jailbreaking, prompt injection, tool misuse
- Prominent papers:
  - "Jailbroken: How Does LLM Safety Training Fail?" (Wei et al., 2023) - 85 citations
  - "Prompt Injection Attacks on LLMs" (Willison, 2023) - 42 citations

**Sub-topics:**
- Jailbreaking (10 papers)
- Prompt injection (7 papers)
- Tool misuse (5 papers)

### Cluster 4: Theory and Frameworks (37 papers, 26.1%)
**Characteristics:**
- Focus: Goal misalignment, corrigibility, value learning theory
- Prominent papers:
  - "Open Problems in Cooperative AI" (Dafoe et al., 2020) - 180 citations
  - "Learning to Summarize from Human Feedback" (Stiennon et al., 2020) - 850 citations

## Gap Analysis

### Under-Researched Topics (Sparse Clusters)

1. **Autonomous Replication Risks** - Only 3 papers
   - Limited empirical work on self-replication prevention
   - No papers on detection methods for replication attempts

2. **Long-Horizon Planning Failures** - Only 4 papers
   - Most evaluation focuses on single-turn or short conversations
   - Missing: Multi-day planning failure modes

3. **Tool Misuse Cascades** - Only 5 papers in failure modes cluster
   - Gap: How misuse of one tool enables misuse of others
   - Missing: Automated detection of tool misuse patterns

4. **Multimodal Agent Safety** - Only 2 papers
   - Text-only focus dominates (95% of papers)
   - Missing: Image, video, audio modality risks

### Contradictory Findings

1. **RLHF Effectiveness for Agentic Behavior**
   - Pro: 12 papers report improved safety (e.g., Ouyang et al. 2022)
   - Con: 5 papers report RLHF increases deceptive capabilities (e.g., Hubinger et al. 2024)
   - Explanation: Different evaluation protocols, model sizes, and task types

2. **Red Teaming Comprehensiveness**
   - Some papers claim automated red teaming is sufficient (n=8)
   - Others argue human red teamers find unique failures (n=6)
   - Resolution: Likely both needed, but optimal mix unclear

### Missing Integrations

1. **Agent Safety + Robustness Testing** - No papers bridge these areas
2. **Constitutional AI + Tool Use** - Only 1 paper addresses this integration
3. **Multi-Agent Safety** - Only 2 papers on safety in multi-agent settings

## Contradictory Evidence Details

### Finding: RLHF Impact on Deception
- **Pro-RLHF papers (n=12)**: Report reduced harmful outputs, better instruction following
- **Critical papers (n=5)**: Report increased sophistication of deceptive behavior
- **Hypothesis**: RLHF optimizes for human approval, which can incentivize deception
- **Recommendation**: Further research on deception-aware RLHF variants needed

## Coverage Heatmap

| Topic Area | Paper Count | Coverage |
|------------|-------------|----------|
| Evaluation methods | 35 | ████████░░ 80% |
| Alignment techniques | 48 | ██████████ 100% |
| Failure modes (general) | 22 | ████░░░░░░ 40% |
| └─ Jailbreaking | 10 | ██████░░░░ 60% |
| └─ Autonomous replication | 3 | ██░░░░░░░░ 20% |
| └─ Tool misuse | 5 | ███░░░░░░░ 30% |
| Theory/frameworks | 37 | ████████░░ 75% |
| Multimodal safety | 2 | █░░░░░░░░░ 10% |

## Recommendations for Future Research

1. **High Priority Gaps:**
   - Failure taxonomies for agentic systems (only 22 papers vs 48 on alignment)
   - Autonomous replication detection and prevention (only 3 papers)
   - Long-horizon planning safety (only 4 papers)

2. **Methodological Improvements:**
   - Standardize evaluation baselines across papers
   - Develop benchmark for tool misuse detection
   - Create multimodal agent safety dataset

3. **Integration Work:**
   - Constitutional AI + tool use safety
   - Multi-agent coordination safety
   - Agent safety + adversarial robustness

## Acquisition Queue

Selected 50 papers for full acquisition based on:
- High citation count (>50 citations)
- Recent publication (2023-2024)
- Gap-filling potential (sparse cluster papers prioritized)
- Venue quality (A*/A tier conferences/journals)

Queue saved to: `.aiwg/research/discovery/acquisition-queue.json`

Why This Is Good:

  • Follows PRISMA systematic review protocol with proper flow diagram
  • Quantitative topic clustering with paper counts and percentages
  • Identifies specific research gaps with paper counts
  • Documents contradictory findings with evidence from multiple sources
  • Provides coverage heatmap showing research balance
  • Offers actionable recommendations prioritized by gap severity
  • Uses thought protocol throughout (Goal, Extraction, Reasoning, Exception, Synthesis)
  • Generates acquisition queue for downstream Acquisition Agent