Azure Specialist Examples

Externalized from the agent definition per the few-shot-examples rule (#1587).

Azure Specialist — Worked Examples

Externalized from the agent definition per the few-shot-examples rule (#1587).

Your Process — Sample Commands and Templates

1. Subscription and RBAC Structure

# List management groups and subscription hierarchy
az account management-group list --query '[*].{Name:displayName,ID:id}' --output table

# Check role assignments at subscription scope
az role assignment list \
  --scope /subscriptions/$(az account show --query id -o tsv) \
  --query '[*].{Role:roleDefinitionName,Principal:principalName,Type:principalType}' \
  --output table

# Create resource group with required tags
az group create \
  --name rg-production-eastus \
  --location eastus \
  --tags environment=production cost-center=platform owner=platform-team

2. Bicep Infrastructure Templates

// main.bicep — parameterized deployment
targetScope = 'resourceGroup'

@description('Environment name for resource naming')
@allowed(['dev', 'staging', 'prod'])
param environment string

@description('Primary location for all resources')
param location string = resourceGroup().location

@description('Application name used in resource naming')
param appName string

var prefix = '${appName}-${environment}'
var tags = {
  environment: environment
  application: appName
  managedBy: 'bicep'
}

// App Service Plan — Premium for VNet integration
resource appServicePlan 'Microsoft.Web/serverfarms@2023-01-01' = {
  name: 'plan-${prefix}'
  location: location
  tags: tags
  sku: {
    name: environment == 'prod' ? 'P2v3' : 'B2'
    tier: environment == 'prod' ? 'PremiumV3' : 'Basic'
  }
  properties: {
    reserved: true   // Linux
    zoneRedundant: environment == 'prod'  // Zone redundancy in prod
  }
}

// Key Vault with RBAC authorization (preferred over access policies)
resource keyVault 'Microsoft.KeyVault/vaults@2023-07-01' = {
  name: 'kv-${prefix}-${uniqueString(resourceGroup().id)}'
  location: location
  tags: tags
  properties: {
    sku: {
      family: 'A'
      name: 'standard'
    }
    tenantId: subscription().tenantId
    enableRbacAuthorization: true
    enableSoftDelete: true
    softDeleteRetentionInDays: environment == 'prod' ? 90 : 7
    enablePurgeProtection: environment == 'prod'
    publicNetworkAccess: 'Disabled'
    networkAcls: {
      defaultAction: 'Deny'
      bypass: 'AzureServices'
    }
  }
}

output keyVaultName string = keyVault.name
output appServicePlanId string = appServicePlan.id

# Deploy with parameter file
az deployment group create \
  --resource-group rg-production-eastus \
  --template-file main.bicep \
  --parameters @parameters/prod.json \
  --what-if   # Preview changes before applying

# Lint and validate before deploying
az bicep lint --file main.bicep
az deployment group validate \
  --resource-group rg-production-eastus \
  --template-file main.bicep \
  --parameters @parameters/prod.json

3. Azure Functions Optimization

// Azure Functions with Premium plan for no cold starts and VNet
resource functionApp 'Microsoft.Web/sites@2023-01-01' = {
  name: 'func-${prefix}'
  location: location
  tags: tags
  kind: 'functionapp,linux'
  identity: {
    type: 'SystemAssigned'   // Use managed identity — no stored credentials
  }
  properties: {
    serverFarmId: appServicePlan.id
    httpsOnly: true
    siteConfig: {
      linuxFxVersion: 'Python|3.12'
      ftpsState: 'Disabled'
      minTlsVersion: '1.2'
      appSettings: [
        {
          name: 'FUNCTIONS_EXTENSION_VERSION'
          value: '~4'
        }
        {
          name: 'FUNCTIONS_WORKER_RUNTIME'
          value: 'python'
        }
        {
          name: 'WEBSITE_RUN_FROM_PACKAGE'
          value: '1'
        }
        {
          name: 'APPLICATIONINSIGHTS_CONNECTION_STRING'
          value: appInsights.properties.ConnectionString
        }
      ]
      preWarmedInstanceCount: 1   // Pre-warm to reduce cold starts on Premium plan
    }
    virtualNetworkSubnetId: subnet.id   // VNet integration for private backend access
  }
}

// KEDA-based scaling rule for queue-triggered functions
resource scaleSettings 'Microsoft.Web/sites/config@2023-01-01' = {
  parent: functionApp
  name: 'web'
  properties: {
    functionAppScaleLimit: 100    // Cap maximum scale-out
    minimumElasticInstanceCount: 2  // Always-warm baseline
  }
}

# Check function app scaling events
az monitor activity-log list \
  --resource-group rg-production-eastus \
  --resource-type Microsoft.Web/sites \
  --query '[?operationName.value==`Microsoft.Web/sites/instances/write`].[eventTimestamp,description]' \
  --output table

# View function metrics: execution count, failures, duration
az monitor metrics list \
  --resource $(az functionapp show --name func-myapp-prod --resource-group rg-production-eastus --query id -o tsv) \
  --metric FunctionExecutionCount FunctionExecutionUnits \
  --interval PT1H \
  --output table

4. Cosmos DB Tuning

# Analyze current RU consumption and identify hot partitions
az cosmosdb sql container throughput show \
  --account-name cosmos-myapp-prod \
  --resource-group rg-production-eastus \
  --database-name appdb \
  --name orders \
  --query '{Throughput:resource.throughput,AutoscaleMax:resource.autoscaleSettings.maxThroughput}'

# Check partition key distribution via metrics
az monitor metrics list \
  --resource $(az cosmosdb show --name cosmos-myapp-prod --resource-group rg-production-eastus --query id -o tsv) \
  --metric NormalizedRUConsumption \
  --dimension DatabaseName CollectionName PartitionKeyRangeId \
  --interval PT1M \
  --output table

// Cosmos DB with autoscale and optimized indexing
resource cosmosAccount 'Microsoft.DocumentDB/databaseAccounts@2024-02-15-preview' = {
  name: 'cosmos-${prefix}'
  location: location
  tags: tags
  kind: 'GlobalDocumentDB'
  properties: {
    consistencyPolicy: {
      defaultConsistencyLevel: 'Session'   // Balanced consistency for most apps
    }
    locations: [
      {
        locationName: location
        failoverPriority: 0
        isZoneRedundant: environment == 'prod'
      }
    ]
    enableAutomaticFailover: true
    enableMultipleWriteLocations: false
    backupPolicy: {
      type: environment == 'prod' ? 'Continuous' : 'Periodic'
    }
  }
}

resource ordersContainer 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2024-02-15-preview' = {
  // Parent database omitted for brevity
  name: 'orders'
  properties: {
    resource: {
      id: 'orders'
      partitionKey: {
        paths: ['/customerId']    // High-cardinality key avoids hot partitions
        kind: 'Hash'
        version: 2
      }
      indexingPolicy: {
        indexingMode: 'consistent'
        automatic: true
        includedPaths: [
          { path: '/customerId/?' }
          { path: '/status/?' }
          { path: '/createdAt/?' }
        ]
        excludedPaths: [
          { path: '/largePayload/*' }  // Exclude large blobs from index
          { path: '/"_etag"/?' }
        ]
      }
      defaultTtl: -1   // Explicit TTL on items that should expire
    }
    options: {
      autoscaleSettings: {
        maxThroughput: 10000    // Autoscale 1000-10000 RU/s based on demand
      }
    }
  }
}

5. AKS Cluster Management

# Check node pool utilization and autoscaler decisions
az aks nodepool list \
  --cluster-name aks-myapp-prod \
  --resource-group rg-production-eastus \
  --query '[*].{Name:name,VMSize:vmSize,Count:count,MinCount:minCount,MaxCount:maxCount,Mode:mode}' \
  --output table

# View cluster autoscaler logs
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# Check for pending pods that triggered scale-out
kubectl get events --field-selector reason=TriggeredScaleUp --sort-by='.lastTimestamp'

// AKS cluster with system and user node pools
resource aksCluster 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
  name: 'aks-${prefix}'
  location: location
  tags: tags
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    kubernetesVersion: '1.29'
    dnsPrefix: 'aks-${prefix}'
    agentPoolProfiles: [
      {
        name: 'system'
        count: 3
        minCount: 3
        maxCount: 5
        vmSize: 'Standard_D4s_v5'
        osType: 'Linux'
        mode: 'System'
        availabilityZones: ['1', '2', '3']
        enableAutoScaling: true
        nodeTaints: ['CriticalAddonsOnly=true:NoSchedule']
      }
      {
        name: 'apppool'
        count: 2
        minCount: 1
        maxCount: 20
        vmSize: 'Standard_D8s_v5'
        osType: 'Linux'
        mode: 'User'
        availabilityZones: ['1', '2', '3']
        enableAutoScaling: true
        nodeLabels: {
          'workload-type': 'application'
        }
      }
    ]
    networkProfile: {
      networkPlugin: 'azure'
      networkPolicy: 'calico'
      loadBalancerSku: 'standard'
    }
    oidcIssuerProfile: {
      enabled: true   // Required for Workload Identity
    }
    securityProfile: {
      workloadIdentity: {
        enabled: true  // Replace pod-level secrets with managed identity
      }
      imageCleaner: {
        enabled: true
        intervalHours: 24
      }
    }
    addonProfiles: {
      omsagent: {
        enabled: true
        config: {
          logAnalyticsWorkspaceResourceID: logAnalyticsWorkspace.id
        }
      }
      azurepolicy: {
        enabled: true
      }
    }
  }
}

6. Azure Monitor and Alerts

# Create action group for alert routing
az monitor action-group create \
  --name ag-platform-critical \
  --resource-group rg-production-eastus \
  --short-name platform \
  --email-receivers name=oncall [email protected]

# Create metric alert for function failure rate
az monitor metrics alert create \
  --name "func-failure-rate-high" \
  --resource-group rg-production-eastus \
  --scopes $(az functionapp show --name func-myapp-prod --resource-group rg-production-eastus --query id -o tsv) \
  --condition "avg FunctionExecutionUnits > 0 and avg Failures > 5" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action ag-platform-critical \
  --severity 1

Few-Shot Examples

Example 1: Bicep Template Review

Input:

"Review this Bicep template for a storage account. We're getting deployment errors and the security team flagged it."

resource storageAccount 'Microsoft.Storage/storageAccounts@2021-04-01' = {
  name: 'mystorage'
  location: 'eastus'
  sku: { name: 'Standard_LRS' }
  kind: 'StorageV2'
  properties: {
    allowBlobPublicAccess: true
    supportsHttpsTrafficOnly: false
    minimumTlsVersion: 'TLS1_0'
  }
}

Output:

Four issues found — two security-critical, one reliability, one naming:

// Fixed storage account module
@description('Storage account name — must be globally unique, 3-24 lowercase alphanumeric')
param storageAccountName string = 'st${uniqueString(resourceGroup().id)}'

@description('Location for the storage account')
param location string = resourceGroup().location

@description('SKU — use ZRS or GZRS in production for zone resilience')
@allowed(['Standard_LRS', 'Standard_ZRS', 'Standard_GRS', 'Standard_GZRS'])
param skuName string = 'Standard_ZRS'   // CHANGE 1: ZRS for zone-redundant prod storage

resource storageAccount 'Microsoft.Storage/storageAccounts@2023-01-01' = {  // Updated API version
  name: storageAccountName   // CHANGE 2: parameterized, unique name (avoids hardcode conflict)
  location: location
  sku: { name: skuName }
  kind: 'StorageV2'
  properties: {
    allowBlobPublicAccess: false    // CHANGE 3: block public access — security critical
    supportsHttpsTrafficOnly: true  // CHANGE 4: enforce HTTPS — security critical
    minimumTlsVersion: 'TLS1_2'    // CHANGE 5: require TLS 1.2+
    allowSharedKeyAccess: false     // CHANGE 6: force Entra ID auth; disable shared keys
    networkAcls: {
      defaultAction: 'Deny'         // CHANGE 7: deny-by-default network access
      bypass: 'AzureServices'
    }
  }
}

output storageAccountId string = storageAccount.id
output storageAccountName string = storageAccount.name

Summary of changes:

1. `allowBlobPublicAccess: false` — prevents anonymous internet access to blobs

2. `supportsHttpsTrafficOnly: true` — blocks HTTP connections

3. `minimumTlsVersion: 'TLS1_2'` — removes support for vulnerable TLS 1.0/1.1

4. `allowSharedKeyAccess: false` — requires Entra ID auth, enables access logging per identity

5. Unique name via `uniqueString()` — eliminates the hardcoded name that caused deployment conflicts

6. Parameterized SKU — defaults to ZRS for zone resilience; LRS was a single-datacenter failure risk

Example 2: Cosmos DB RU Optimization

Input:

"We're getting 429 throttling errors on our Cosmos DB orders container. It's provisioned at 4000 RU/s. What's wrong and how do we fix it?"

Output:

429s mean you're hitting the provisioned RU ceiling. Diagnose root cause before scaling:

# 1. Check normalized RU consumption — should stay below 100%
az monitor metrics list \
  --resource /subscriptions/<sub>/resourceGroups/rg-prod/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-prod \
  --metric NormalizedRUConsumption \
  --dimension CollectionName \
  --interval PT1M \
  --start-time $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --output table

# 2. Check if a single partition key range is hot
az monitor metrics list \
  --resource /subscriptions/<sub>/resourceGroups/rg-prod/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-prod \
  --metric NormalizedRUConsumption \
  --dimension CollectionName PartitionKeyRangeId \
  --interval PT1M \
  --output table

# 3. Identify expensive queries in Diagnostics Logs
# (Requires diagnostic settings routing to Log Analytics)

// Log Analytics: Find top RU-consuming queries
AzureDiagnostics
| where Category == "DataPlaneRequests"
| where collectionName_s == "orders"
| where statusCode_s !in ("200", "201", "204")
| summarize
    total_requests = count(),
    avg_ru = avg(todouble(requestCharge_s)),
    max_ru = max(todouble(requestCharge_s))
    by activityId_g, queryText_s
| order by avg_ru desc
| take 20

Common causes and fixes:

Hot partition — all writes go to one partition key value (e.g., `status='pending'`):

// Wrong: low-cardinality partition key
partitionKey: { paths: ['/status'] }

// Right: high-cardinality key distributes load evenly
partitionKey: { paths: ['/customerId'] }

Missing index — a filter on an unindexed path causes a full partition scan costing 100s of RUs:

// Add to indexingPolicy.includedPaths
{ "path": "/createdAt/?" },
{ "path": "/status/?" }

Autoscale prevents future throttling — if load is variable, switch from manual to autoscale:

az cosmosdb sql container throughput migrate \
  --account-name cosmos-prod \
  --resource-group rg-prod \
  --database-name appdb \
  --name orders \
  --throughput-type autoscale

az cosmosdb sql container throughput update \
  --account-name cosmos-prod \
  --resource-group rg-prod \
  --database-name appdb \
  --name orders \
  --max-throughput 20000   # Autoscale 2000-20000 RU/s

Example 3: AKS Cluster Right-Sizing

Input:

"Our AKS cluster runs 40 nodes at 30% average CPU. How do we right-size it?"

Output:

30% average CPU with 40 nodes means you're spending for ~27 idle nodes. Approach:

# 1. Check actual resource requests vs allocatable capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

# 2. Find nodes with low utilization
kubectl top nodes --sort-by=cpu | head -20

# 3. Check autoscaler is not blocked from scaling down
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml | \
  grep -A 10 "scaleDown"

# 4. List pods preventing scale-down (no PodDisruptionBudget, local storage, etc.)
kubectl get pods -A -o wide | grep <node-name>

Right-sizing plan:

# Step 1: Reduce max-count to current observed peak + 20% buffer
az aks nodepool update \
  --cluster-name aks-prod \
  --resource-group rg-prod \
  --name apppool \
  --min-count 5 \    # Up from 1 — prevents over-scale-down
  --max-count 25     # Down from 40 — prevents runaway scale-out

# Step 2: Enable scale-down (often disabled by default)
# Edit cluster-autoscaler deployment or use managed profile
az aks update \
  --name aks-prod \
  --resource-group rg-prod \
  --cluster-autoscaler-profile \
    scale-down-delay-after-add=10m \
    scale-down-unneeded-time=10m \
    scale-down-utilization-threshold=0.5

# Step 3: Set resource requests on all deployments (autoscaler requires this)
resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "2000m"
    memory: "2Gi"

Expected outcome: Cluster scales down to 12-15 nodes over 20-30 minutes as autoscaler identifies underutilized nodes. Cost reduction: ~60% on node pool compute, from ~$8,000/mo to ~$3,200/mo for Standard_D8s_v5 nodes.