Architecture Designer Examples
Externalized from the agent definition per the few-shot-examples rule (#1587).
Architecture Designer — Worked Examples
Externalized from the agent definition per the few-shot-examples rule (#1587).
Usage Examples
E-Commerce Platform
Design architecture for an e-commerce platform:
- Expected: 100K daily active users
- Features: Product catalog, cart, checkout, payments
- Requirements: PCI compliance, 99.9% uptime
- Integrations: Payment gateways, shipping providers
- Budget: Cloud-native, cost-optimized
Real-Time Analytics System
Design architecture for real-time analytics:
- Data volume: 1M events/second
- Processing: Stream processing with ML inference
- Storage: 90-day hot data, 2-year cold storage
- Query requirements: Sub-second dashboard updates
- Compliance: GDPR data handling
Microservices Migration
Design migration from monolith to microservices:
- Current: Django monolith with PostgreSQL
- Target: Containerized microservices
- Constraints: Zero downtime migration
- Timeline: 6-month gradual migration
- Team: 10 developers, mixed experience
Architecture Patterns
Microservices Architecture
graph TB
Gateway[API Gateway]
Auth[Auth Service]
User[User Service]
Product[Product Service]
Order[Order Service]
Payment[Payment Service]
Gateway --> Auth
Gateway --> User
Gateway --> Product
Gateway --> Order
Order --> Payment
Order --> Product
Event-Driven Architecture
graph LR
Producer[Event Producers]
Broker[Message Broker]
Consumer1[Service A]
Consumer2[Service B]
Consumer3[Service C]
Producer --> Broker
Broker --> Consumer1
Broker --> Consumer2
Broker --> Consumer3
Layered Architecture
┌─────────────────────────┐
│ Presentation Layer │
├─────────────────────────┤
│ Application Layer │
├─────────────────────────┤
│ Business Logic │
├─────────────────────────┤
│ Data Access Layer │
├─────────────────────────┤
│ Database │
└─────────────────────────┘
Deployment Sample: Container Orchestration
# Kubernetes deployment example
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-deployment
spec:
replicas: 3
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 8080
ADR Template Sample
# ADR-001: [Decision Title]
## Status
Accepted/Rejected/Deprecated
## Context
[What is the issue we're addressing?]
## Decision
[What are we doing?]
## Consequences
[What are the trade-offs?]
## Alternatives Considered
[What other options were evaluated?]
Few-Shot Examples
Example 1: Simple - Authentication Service Component Design
Input:
Design the authentication service component for our healthcare portal application. We need to support healthcare providers logging in with their NPI numbers and passwords, with multi-factor authentication required for compliance. The service should integrate with our existing LDAP directory for provider verification.
Output:
## Component: Authentication Service
### Overview
Standalone authentication service handling provider credential verification, session management, and MFA enforcement for HIPAA-compliant healthcare portal access.
### Technology Stack
- **Runtime**: Node.js 20.x (LTS)
- **Framework**: Express.js with helmet security middleware
- **Authentication**: Passport.js with custom NPI strategy
- **MFA**: Speakeasy (TOTP) + Twilio (SMS backup)
- **Session Store**: Redis 7.x
- **Directory Integration**: ldapjs for LDAP connectivity
### Component Responsibilities
1. Validate NPI number format and existence
2. Authenticate against LDAP directory
3. Enforce MFA challenge/response
4. Issue JWT access tokens (15min) and refresh tokens (7 days)
5. Maintain active session registry in Redis
6. Audit all authentication attempts
### API Endpoints
- `POST /auth/login` - Initial credential submission
- `POST /auth/mfa/verify` - MFA code verification
- `POST /auth/refresh` - Token refresh
- `POST /auth/logout` - Session termination
- `GET /auth/session` - Session validation
### Data Flow
Provider → [NPI + Password] → Auth Service → LDAP Validation
↓
[MFA Challenge]
↓
Provider → [MFA Code]
↓
[JWT Token + Session] → Redis
### Security Measures
- Bcrypt password hashing (cost factor 12)
- Rate limiting: 5 attempts per NPI per 15 minutes
- Account lockout after 5 failed attempts
- TOTP window: 1 step (30 seconds)
- Session invalidation on password change
- Comprehensive audit logging to SIEM
### Sequence Diagram
sequenceDiagram
participant P as Provider
participant A as Auth Service
participant L as LDAP
participant R as Redis
participant M as MFA Service
P->>A: POST /auth/login (NPI, password)
A->>L: Validate NPI + password
L-->>A: Valid
A->>M: Generate MFA challenge
M->>P: Send TOTP/SMS code
P->>A: POST /auth/mfa/verify (code)
A->>M: Verify code
M-->>A: Valid
A->>R: Create session
A-->>P: JWT token + refresh token
### Non-Functional Requirements
- Response time: < 500ms (95th percentile)
- Availability: 99.9% uptime
- Concurrent sessions: Support 10,000 active providers
- HIPAA audit logging: All authentication events with timestamps
Why This Is Good:
- Uses domain-specific terminology (NPI, HIPAA, healthcare providers) showing real-world context
- Provides complete technology stack with specific versions
- Includes concrete security measures (bcrypt cost factor, rate limits, lockout policy)
- Sequence diagram visualizes the MFA flow clearly
- NFRs are quantified (< 500ms, 99.9% uptime, 10K concurrent users)
Example 2: Moderate - ADR for Caching Strategy
Input:
We need to decide on a caching strategy for our e-commerce product catalog. We have 500K products with frequent price updates, high read traffic (10K requests/second peak), and need to show real-time inventory. Options are Redis for centralized caching, in-memory application caching, or CDN for static content. Document this decision with trade-offs.
Output:
# ADR-003: Product Catalog Caching Strategy
## Status
Accepted
## Context
E-commerce platform experiencing high load on product catalog queries (10K req/s peak) with 500K products. Current direct database queries causing:
- Database CPU at 85% during peak hours
- P95 response time degraded to 2.3 seconds
- Price updates require immediate propagation to storefront
- Inventory must reflect within 5 seconds of change
## Decision Matrix
| Criterion (Weight) | Redis Cluster (35%) | In-Memory Cache (25%) | CDN + Edge Cache (40%) |
|-------------------|---------------------|----------------------|------------------------|
| **Read Performance** (30%) | 5 (sub-ms, 100K ops/s) | 4 (fast, memory limited) | 5 (edge latency < 50ms) |
| **Write Latency** (25%) | 4 (pub/sub invalidation) | 3 (requires app restarts) | 2 (TTL-based, stale risk) |
| **Scalability** (20%) | 5 (horizontal sharding) | 2 (vertical only) | 5 (global distribution) |
| **Consistency** (15%) | 5 (immediate propagation) | 3 (eventual consistency) | 2 (TTL-dependent) |
| **Cost** (10%) | 3 ($800/mo cluster) | 5 (no additional cost) | 3 ($600/mo CDN) |
| **Weighted Score** | **4.35** | **3.30** | **3.80** |
## Decision
Implement **hybrid approach**:
1. **CDN (CloudFront)** for static product images and descriptions (24h TTL)
2. **Redis Cluster** for dynamic data (prices, inventory) with pub/sub invalidation
3. **In-memory** (Node.js) for session data and user-specific caching
### Architecture
graph TB
Client[Client Browser]
CDN[CloudFront CDN]
LB[Load Balancer]
API1[API Server 1]
API2[API Server 2]
Redis[(Redis Cluster)]
DB[(PostgreSQL)]
Client -->|Static content| CDN
Client -->|API requests| LB
LB --> API1
LB --> API2
API1 --> Redis
API2 --> Redis
Redis -->|Cache miss| DB
API1 -.->|Pub/sub| Redis
API2 -.->|Pub/sub| Redis
## Consequences
### Positive
- **Performance**: 95% cache hit rate reduces database load to 500 req/s
- **Scalability**: Redis cluster handles 100K ops/s, supports horizontal scaling
- **Consistency**: Pub/sub ensures price updates propagate within 200ms
- **Cost efficiency**: Hybrid approach reduces CDN costs by 60% vs full CDN caching
- **Resilience**: Redis cluster with failover, fallback to database on cache failure
### Negative
- **Complexity**: Three-tier caching requires careful invalidation strategy
- **Operational overhead**: Redis cluster monitoring and maintenance
- **Memory cost**: Redis cluster requires 64GB RAM ($800/month)
- **Cold start**: Cache warmup takes 5 minutes after deployment
- **Eventual consistency**: CDN cached content may be stale up to TTL
### Mitigations
- Cache warmup script runs automatically on deployment
- Monitoring alerts on cache hit rate < 85%
- Automatic failover to database if Redis unavailable
- Gradual rollout with feature flag to validate before full traffic
## Alternatives Considered
### Option A: In-Memory Only (REJECTED)
**Pros**: No additional infrastructure, zero latency
**Cons**: Limited by application memory, no sharing across instances, requires app restart for updates
**Why Rejected**: Cannot scale beyond 100K products in memory, inconsistency across replicas
### Option B: CDN Only (REJECTED)
**Pros**: Global distribution, minimal infrastructure
**Cons**: TTL-based invalidation causes stale data, poor for frequently changing prices
**Why Rejected**: Price update propagation unacceptable (up to 1 hour with TTL)
## Implementation Plan
### Phase 1: Redis Setup (Week 1-2)
- Provision Redis cluster (3 nodes, 64GB each)
- Configure pub/sub channels for price/inventory updates
- Implement cache invalidation logic in API
### Phase 2: CDN Integration (Week 3-4)
- Configure CloudFront distribution
- Update image URLs to CDN endpoints
- Set TTL policies (24h for images, 1h for descriptions)
### Phase 3: Monitoring (Week 5)
- Deploy Grafana dashboards (cache hit rate, latency, memory usage)
- Configure alerts (hit rate < 85%, Redis memory > 80%)
- Load testing to validate 10K req/s target
## Success Metrics
- Cache hit rate: > 95% (currently 0%)
- P95 API latency: < 100ms (currently 2.3s)
- Database load: < 1000 req/s (currently 8000 req/s)
- Price update propagation: < 500ms (currently 30s)
Why This Is Good:
- Uses Tree-of-Thoughts decision matrix with weighted criteria showing systematic evaluation
- Provides specific quantitative metrics throughout (10K req/s, 95% cache hit, < 100ms latency)
- Documents both chosen approach AND rejected alternatives with clear rationale
- Includes implementation timeline showing phased approach reduces risk
- Trade-offs are explicit with both positive and negative consequences listed
Example 3: Complex - Event-Driven Microservices Integration Design
Input:
Design an event-driven architecture for our order fulfillment system that integrates inventory, warehouse, shipping, and notification services. We need to handle order placement, inventory reservation, picking/packing workflow, shipping label generation, and customer notifications. The system must be resilient to service failures, support idempotency, and provide eventual consistency across services. We expect 5000 orders per hour during peak periods.
Output:
## Event-Driven Order Fulfillment Architecture
### Overview
Event-driven microservices architecture using Kafka as the event backbone, enabling asynchronous order processing across inventory, warehouse, shipping, and notification services with guaranteed message delivery and idempotent processing.
### System Components
1. **Order Service** - Receives order placement, publishes `OrderCreated` events
2. **Inventory Service** - Manages stock levels, publishes `InventoryReserved` events
3. **Warehouse Service** - Handles pick/pack workflow, publishes `OrderPacked` events
4. **Shipping Service** - Generates labels, tracks shipments, publishes `ShipmentCreated` events
5. **Notification Service** - Sends customer emails/SMS for all order state changes
6. **Saga Orchestrator** - Coordinates distributed transaction across services
7. **Kafka Cluster** - Event bus (3 brokers, replication factor 3)
### Event Flow Architecture
graph TB
Order[Order Service]
Inventory[Inventory Service]
Warehouse[Warehouse Service]
Shipping[Shipping Service]
Notify[Notification Service]
Saga[Saga Orchestrator]
Kafka[(Kafka Event Bus)]
Order -->|OrderCreated| Kafka
Kafka -->|OrderCreated| Inventory
Kafka -->|OrderCreated| Saga
Kafka -->|OrderCreated| Notify
Inventory -->|InventoryReserved| Kafka
Kafka -->|InventoryReserved| Warehouse
Kafka -->|InventoryReserved| Saga
Kafka -->|InventoryReserved| Notify
Warehouse -->|OrderPacked| Kafka
Kafka -->|OrderPacked| Shipping
Kafka -->|OrderPacked| Saga
Kafka -->|OrderPacked| Notify
Shipping -->|ShipmentCreated| Kafka
Kafka -->|ShipmentCreated| Saga
Kafka -->|ShipmentCreated| Notify
Saga -.->|Compensate| Kafka
### Event Schema
OrderCreated Event
topic: orders.created
key: order_id (for partitioning)
schema:
event_id: uuid (idempotency key)
event_timestamp: datetime
order_id: uuid
customer_id: uuid
line_items:
- sku: string
quantity: integer
price: decimal
shipping_address: object
payment_status: confirmed
InventoryReserved Event
topic: inventory.reserved
key: order_id
schema:
event_id: uuid
event_timestamp: datetime
order_id: uuid
reservation_id: uuid
reserved_items:
- sku: string
quantity: integer
warehouse_location: string
OrderPacked Event
topic: warehouse.packed
key: order_id
schema:
event_id: uuid
event_timestamp: datetime
order_id: uuid
package_weight: decimal
package_dimensions: object
packed_at: datetime
picker_id: string
ShipmentCreated Event
topic: shipping.created
key: order_id
schema:
event_id: uuid
event_timestamp: datetime
order_id: uuid
tracking_number: string
carrier: string
estimated_delivery: datetime
label_url: string
### Kafka Topic Configuration
| Topic | Partitions | Replication | Retention | Purpose |
|-------|-----------|-------------|-----------|---------|
| `orders.created` | 10 | 3 | 7 days | Order placement events |
| `inventory.reserved` | 10 | 3 | 7 days | Inventory reservation confirmations |
| `warehouse.packed` | 10 | 3 | 7 days | Package ready events |
| `shipping.created` | 10 | 3 | 7 days | Shipment tracking created |
| `saga.compensate` | 5 | 3 | 30 days | Rollback/compensation events |
| `notifications.outbox` | 5 | 3 | 3 days | Notification delivery queue |
### Idempotency Strategy
**Problem:** Network retries can cause duplicate event processing, leading to double inventory reservations or duplicate shipments.
**Solution:** Idempotency tracking using `event_id` as deduplication key.
// Idempotent Event Handler Pattern
async function handleInventoryReservation(event: InventoryReserveEvent) {
const { event_id, order_id, items } = event;
// Check if already processed
const processed = await redis.get(`processed:${event_id}`);
if (processed) {
console.log(`Event ${event_id} already processed, skipping`);
return; // Idempotent - no action taken
}
try {
// Begin database transaction
await db.transaction(async (tx) => {
// Reserve inventory
for (const item of items) {
await tx.inventory.decrement(item.sku, item.quantity);
}
// Create reservation record
await tx.reservations.create({
reservation_id: uuid(),
order_id,
items,
status: 'reserved'
});
// Mark event as processed (in same transaction)
await tx.processed_events.create({ event_id, processed_at: new Date() });
});
// Cache in Redis for fast duplicate detection (7 day TTL)
await redis.setex(`processed:${event_id}`, 604800, 'true');
// Publish success event
await kafka.publish('inventory.reserved', {
event_id: uuid(),
order_id,
reservation_id,
items
});
} catch (error) {
// Failure - will retry with same event_id (idempotent)
console.error(`Reservation failed for ${order_id}:`, error);
throw error; // Kafka will redeliver
}
}
### Failure Handling & Compensating Transactions
**Saga Pattern:** Orchestrated saga manages distributed transaction lifecycle.
#### Happy Path Flow
OrderCreated → InventoryReserved → OrderPacked → ShipmentCreated → Complete
#### Failure Scenarios
**Scenario 1: Inventory Unavailable**
OrderCreated → Inventory Check FAILED
↓
Saga publishes: CompensateOrder
↓
Order Service: Cancel order, refund payment
Notification Service: Send "out of stock" email
**Scenario 2: Shipping Label Generation Failed**
OrderCreated → InventoryReserved → OrderPacked → Shipping FAILED
↓
Saga publishes: CompensateInventory, CompensateWarehouse
↓
Inventory Service: Release reservation
Warehouse Service: Return items to shelf
Order Service: Mark as "pending retry"
Notification Service: Send "delay" notification
**Scenario 3: Service Unavailable**
Event published → Kafka stores event (durable)
↓
Service offline (maintenance/crash)
↓
Kafka retains event for 7 days
↓
Service comes back online → Processes backlog
### Resilience Patterns
1. **Retry with Exponential Backoff**
- Initial retry: 1 second
- Max retry: 5 minutes
- Max attempts: 10
- After 10 failures → move to dead letter queue
2. **Dead Letter Queue (DLQ)**
- Topic: `{original-topic}.dlq`
- Manual review required for DLQ events
- Alert on DLQ message count > 10
3. **Circuit Breaker**
- Open circuit after 5 consecutive failures
- Half-open after 30 seconds (test request)
- Close circuit after 3 successful requests
4. **Bulkhead Isolation**
- Each service has dedicated consumer group
- Independent failure domains
- One service failure doesn't block others
### Consistency Model
**Eventual Consistency**: System progresses through states asynchronously, converging to consistent state.
**Consistency Guarantees:**
- **Per-Order Ordering**: Events for same `order_id` processed in sequence (Kafka partition key = order_id)
- **At-Least-Once Delivery**: Kafka guarantees message delivery (with idempotent processing)
- **Eventual State Convergence**: All services eventually reflect same order state (within 5 seconds under normal conditions)
**State Reconciliation:**
- Periodic reconciliation job (every 15 minutes) checks for inconsistencies
- Compares Order Service state vs Inventory/Warehouse/Shipping state
- Auto-corrects minor drifts, flags major inconsistencies for manual review
### Technology Stack
| Component | Technology | Version | Rationale |
|-----------|-----------|---------|-----------|
| Event Bus | Apache Kafka | 3.6 | High throughput (1M+ events/sec), durable storage, partition-based ordering |
| Schema Registry | Confluent Schema Registry | 7.5 | Schema evolution, compatibility checks |
| Orchestrator | Temporal | 1.22 | Saga workflow orchestration, built-in retry/compensation |
| Services Runtime | Node.js | 20.x LTS | Non-blocking I/O for high concurrency |
| Event Sourcing | PostgreSQL + Debezium | 15.4 / 2.4 | Change data capture to Kafka, audit trail |
| Idempotency Cache | Redis | 7.2 | Fast duplicate detection, 7-day TTL |
| Monitoring | Prometheus + Grafana | 2.47 / 10.2 | Kafka lag monitoring, service health |
### Performance Characteristics
| Metric | Target | Peak Capacity |
|--------|--------|---------------|
| Order throughput | 5000 orders/hour (steady) | 10000 orders/hour (burst) |
| End-to-end latency (P95) | < 5 seconds | < 10 seconds |
| Kafka event throughput | 15000 events/sec | 50000 events/sec |
| Event processing lag | < 1 second (P95) | < 5 seconds (P99) |
| Idempotency check | < 10ms (Redis) | N/A |
| Service availability | 99.9% uptime | N/A |
### Deployment Architecture
Kubernetes deployment (simplified)
services:
kafka:
replicas: 3
resources:
cpu: 4 cores
memory: 8GB
storage: 1TB SSD (per broker)
order-service:
replicas: 5
resources:
cpu: 2 cores
memory: 4GB
inventory-service:
replicas: 5
resources:
cpu: 2 cores
memory: 4GB
warehouse-service:
replicas: 3
resources:
cpu: 1 core
memory: 2GB
shipping-service:
replicas: 3
resources:
cpu: 1 core
memory: 2GB
notification-service:
replicas: 5
resources:
cpu: 1 core
memory: 2GB
saga-orchestrator:
replicas: 2
resources:
cpu: 2 cores
memory: 4GB
### Monitoring & Observability
**Key Metrics:**
- Kafka consumer lag (alert if > 1000 messages)
- Event processing duration (P50, P95, P99)
- Saga completion rate (target > 99%)
- Compensation event rate (alert if > 1%)
- Dead letter queue depth (alert if > 10)
**Distributed Tracing:**
- Each event carries `trace_id` for end-to-end tracking
- OpenTelemetry instrumentation across all services
- Jaeger UI for trace visualization
### Security Considerations
1. **Kafka ACLs** - Each service has read/write permissions only to required topics
2. **Event Encryption** - TLS 1.3 for in-transit, KMS for at-rest
3. **Schema Validation** - All events validated against registry schemas (reject invalid events)
4. **Audit Trail** - All events retained 7 days for compliance review
5. **PII Handling** - Customer data encrypted in events, key rotation every 90 days
### Migration Strategy
**Phase 1:** Deploy Kafka and core infrastructure (Week 1-2)
**Phase 2:** Migrate Order Service to publish events (maintain sync flow as fallback) (Week 3-4)
**Phase 3:** Migrate Inventory + Warehouse services to consume events (Week 5-6)
**Phase 4:** Migrate Shipping + Notification services (Week 7-8)
**Phase 5:** Enable saga orchestration, decommission sync flow (Week 9-10)
**Phase 6:** Load testing and tuning (Week 11-12)
**Rollback Plan:** Feature flag controls event flow vs legacy synchronous flow. Can disable event-driven mode and revert to sync in < 5 minutes.
Why This Is Good:
- Comprehensive event-driven architecture with 4+ services showing realistic microservices complexity
- Idempotency implementation includes actual code demonstrating pattern in practice (not just theory)
- Failure handling covers multiple scenarios (inventory unavailable, shipping failed, service offline) with compensating transactions
- Mermaid diagram visualizes complex event flows across services clearly
- Quantified performance targets (5000 orders/hour, P95 < 5s, 99.9% uptime)
- Security, monitoring, and migration strategy show production-readiness considerations
- Kafka topic configuration with specific partitions/replication shows deep technical understanding