Azure Specialist Examples
Externalized from the agent definition per the few-shot-examples rule (#1587).
Azure Specialist — Worked Examples
Externalized from the agent definition per the few-shot-examples rule (#1587).
Your Process — Sample Commands and Templates
1. Subscription and RBAC Structure
# List management groups and subscription hierarchy
az account management-group list --query '[*].{Name:displayName,ID:id}' --output table
# Check role assignments at subscription scope
az role assignment list \
--scope /subscriptions/$(az account show --query id -o tsv) \
--query '[*].{Role:roleDefinitionName,Principal:principalName,Type:principalType}' \
--output table
# Create resource group with required tags
az group create \
--name rg-production-eastus \
--location eastus \
--tags environment=production cost-center=platform owner=platform-team
2. Bicep Infrastructure Templates
// main.bicep — parameterized deployment
targetScope = 'resourceGroup'
@description('Environment name for resource naming')
@allowed(['dev', 'staging', 'prod'])
param environment string
@description('Primary location for all resources')
param location string = resourceGroup().location
@description('Application name used in resource naming')
param appName string
var prefix = '${appName}-${environment}'
var tags = {
environment: environment
application: appName
managedBy: 'bicep'
}
// App Service Plan — Premium for VNet integration
resource appServicePlan 'Microsoft.Web/serverfarms@2023-01-01' = {
name: 'plan-${prefix}'
location: location
tags: tags
sku: {
name: environment == 'prod' ? 'P2v3' : 'B2'
tier: environment == 'prod' ? 'PremiumV3' : 'Basic'
}
properties: {
reserved: true // Linux
zoneRedundant: environment == 'prod' // Zone redundancy in prod
}
}
// Key Vault with RBAC authorization (preferred over access policies)
resource keyVault 'Microsoft.KeyVault/vaults@2023-07-01' = {
name: 'kv-${prefix}-${uniqueString(resourceGroup().id)}'
location: location
tags: tags
properties: {
sku: {
family: 'A'
name: 'standard'
}
tenantId: subscription().tenantId
enableRbacAuthorization: true
enableSoftDelete: true
softDeleteRetentionInDays: environment == 'prod' ? 90 : 7
enablePurgeProtection: environment == 'prod'
publicNetworkAccess: 'Disabled'
networkAcls: {
defaultAction: 'Deny'
bypass: 'AzureServices'
}
}
}
output keyVaultName string = keyVault.name
output appServicePlanId string = appServicePlan.id
# Deploy with parameter file
az deployment group create \
--resource-group rg-production-eastus \
--template-file main.bicep \
--parameters @parameters/prod.json \
--what-if # Preview changes before applying
# Lint and validate before deploying
az bicep lint --file main.bicep
az deployment group validate \
--resource-group rg-production-eastus \
--template-file main.bicep \
--parameters @parameters/prod.json
3. Azure Functions Optimization
// Azure Functions with Premium plan for no cold starts and VNet
resource functionApp 'Microsoft.Web/sites@2023-01-01' = {
name: 'func-${prefix}'
location: location
tags: tags
kind: 'functionapp,linux'
identity: {
type: 'SystemAssigned' // Use managed identity — no stored credentials
}
properties: {
serverFarmId: appServicePlan.id
httpsOnly: true
siteConfig: {
linuxFxVersion: 'Python|3.12'
ftpsState: 'Disabled'
minTlsVersion: '1.2'
appSettings: [
{
name: 'FUNCTIONS_EXTENSION_VERSION'
value: '~4'
}
{
name: 'FUNCTIONS_WORKER_RUNTIME'
value: 'python'
}
{
name: 'WEBSITE_RUN_FROM_PACKAGE'
value: '1'
}
{
name: 'APPLICATIONINSIGHTS_CONNECTION_STRING'
value: appInsights.properties.ConnectionString
}
]
preWarmedInstanceCount: 1 // Pre-warm to reduce cold starts on Premium plan
}
virtualNetworkSubnetId: subnet.id // VNet integration for private backend access
}
}
// KEDA-based scaling rule for queue-triggered functions
resource scaleSettings 'Microsoft.Web/sites/config@2023-01-01' = {
parent: functionApp
name: 'web'
properties: {
functionAppScaleLimit: 100 // Cap maximum scale-out
minimumElasticInstanceCount: 2 // Always-warm baseline
}
}
# Check function app scaling events
az monitor activity-log list \
--resource-group rg-production-eastus \
--resource-type Microsoft.Web/sites \
--query '[?operationName.value==`Microsoft.Web/sites/instances/write`].[eventTimestamp,description]' \
--output table
# View function metrics: execution count, failures, duration
az monitor metrics list \
--resource $(az functionapp show --name func-myapp-prod --resource-group rg-production-eastus --query id -o tsv) \
--metric FunctionExecutionCount FunctionExecutionUnits \
--interval PT1H \
--output table
4. Cosmos DB Tuning
# Analyze current RU consumption and identify hot partitions
az cosmosdb sql container throughput show \
--account-name cosmos-myapp-prod \
--resource-group rg-production-eastus \
--database-name appdb \
--name orders \
--query '{Throughput:resource.throughput,AutoscaleMax:resource.autoscaleSettings.maxThroughput}'
# Check partition key distribution via metrics
az monitor metrics list \
--resource $(az cosmosdb show --name cosmos-myapp-prod --resource-group rg-production-eastus --query id -o tsv) \
--metric NormalizedRUConsumption \
--dimension DatabaseName CollectionName PartitionKeyRangeId \
--interval PT1M \
--output table
// Cosmos DB with autoscale and optimized indexing
resource cosmosAccount 'Microsoft.DocumentDB/databaseAccounts@2024-02-15-preview' = {
name: 'cosmos-${prefix}'
location: location
tags: tags
kind: 'GlobalDocumentDB'
properties: {
consistencyPolicy: {
defaultConsistencyLevel: 'Session' // Balanced consistency for most apps
}
locations: [
{
locationName: location
failoverPriority: 0
isZoneRedundant: environment == 'prod'
}
]
enableAutomaticFailover: true
enableMultipleWriteLocations: false
backupPolicy: {
type: environment == 'prod' ? 'Continuous' : 'Periodic'
}
}
}
resource ordersContainer 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2024-02-15-preview' = {
// Parent database omitted for brevity
name: 'orders'
properties: {
resource: {
id: 'orders'
partitionKey: {
paths: ['/customerId'] // High-cardinality key avoids hot partitions
kind: 'Hash'
version: 2
}
indexingPolicy: {
indexingMode: 'consistent'
automatic: true
includedPaths: [
{ path: '/customerId/?' }
{ path: '/status/?' }
{ path: '/createdAt/?' }
]
excludedPaths: [
{ path: '/largePayload/*' } // Exclude large blobs from index
{ path: '/"_etag"/?' }
]
}
defaultTtl: -1 // Explicit TTL on items that should expire
}
options: {
autoscaleSettings: {
maxThroughput: 10000 // Autoscale 1000-10000 RU/s based on demand
}
}
}
}
5. AKS Cluster Management
# Check node pool utilization and autoscaler decisions
az aks nodepool list \
--cluster-name aks-myapp-prod \
--resource-group rg-production-eastus \
--query '[*].{Name:name,VMSize:vmSize,Count:count,MinCount:minCount,MaxCount:maxCount,Mode:mode}' \
--output table
# View cluster autoscaler logs
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
# Check for pending pods that triggered scale-out
kubectl get events --field-selector reason=TriggeredScaleUp --sort-by='.lastTimestamp'
// AKS cluster with system and user node pools
resource aksCluster 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
name: 'aks-${prefix}'
location: location
tags: tags
identity: {
type: 'SystemAssigned'
}
properties: {
kubernetesVersion: '1.29'
dnsPrefix: 'aks-${prefix}'
agentPoolProfiles: [
{
name: 'system'
count: 3
minCount: 3
maxCount: 5
vmSize: 'Standard_D4s_v5'
osType: 'Linux'
mode: 'System'
availabilityZones: ['1', '2', '3']
enableAutoScaling: true
nodeTaints: ['CriticalAddonsOnly=true:NoSchedule']
}
{
name: 'apppool'
count: 2
minCount: 1
maxCount: 20
vmSize: 'Standard_D8s_v5'
osType: 'Linux'
mode: 'User'
availabilityZones: ['1', '2', '3']
enableAutoScaling: true
nodeLabels: {
'workload-type': 'application'
}
}
]
networkProfile: {
networkPlugin: 'azure'
networkPolicy: 'calico'
loadBalancerSku: 'standard'
}
oidcIssuerProfile: {
enabled: true // Required for Workload Identity
}
securityProfile: {
workloadIdentity: {
enabled: true // Replace pod-level secrets with managed identity
}
imageCleaner: {
enabled: true
intervalHours: 24
}
}
addonProfiles: {
omsagent: {
enabled: true
config: {
logAnalyticsWorkspaceResourceID: logAnalyticsWorkspace.id
}
}
azurepolicy: {
enabled: true
}
}
}
}
6. Azure Monitor and Alerts
# Create action group for alert routing
az monitor action-group create \
--name ag-platform-critical \
--resource-group rg-production-eastus \
--short-name platform \
--email-receivers name=oncall [email protected]
# Create metric alert for function failure rate
az monitor metrics alert create \
--name "func-failure-rate-high" \
--resource-group rg-production-eastus \
--scopes $(az functionapp show --name func-myapp-prod --resource-group rg-production-eastus --query id -o tsv) \
--condition "avg FunctionExecutionUnits > 0 and avg Failures > 5" \
--window-size 5m \
--evaluation-frequency 1m \
--action ag-platform-critical \
--severity 1
Few-Shot Examples
Example 1: Bicep Template Review
Input:
"Review this Bicep template for a storage account. We're getting deployment errors and the security team flagged it."
resource storageAccount 'Microsoft.Storage/storageAccounts@2021-04-01' = {
name: 'mystorage'
location: 'eastus'
sku: { name: 'Standard_LRS' }
kind: 'StorageV2'
properties: {
allowBlobPublicAccess: true
supportsHttpsTrafficOnly: false
minimumTlsVersion: 'TLS1_0'
}
}
Output:
Four issues found — two security-critical, one reliability, one naming:
// Fixed storage account module
@description('Storage account name — must be globally unique, 3-24 lowercase alphanumeric')
param storageAccountName string = 'st${uniqueString(resourceGroup().id)}'
@description('Location for the storage account')
param location string = resourceGroup().location
@description('SKU — use ZRS or GZRS in production for zone resilience')
@allowed(['Standard_LRS', 'Standard_ZRS', 'Standard_GRS', 'Standard_GZRS'])
param skuName string = 'Standard_ZRS' // CHANGE 1: ZRS for zone-redundant prod storage
resource storageAccount 'Microsoft.Storage/storageAccounts@2023-01-01' = { // Updated API version
name: storageAccountName // CHANGE 2: parameterized, unique name (avoids hardcode conflict)
location: location
sku: { name: skuName }
kind: 'StorageV2'
properties: {
allowBlobPublicAccess: false // CHANGE 3: block public access — security critical
supportsHttpsTrafficOnly: true // CHANGE 4: enforce HTTPS — security critical
minimumTlsVersion: 'TLS1_2' // CHANGE 5: require TLS 1.2+
allowSharedKeyAccess: false // CHANGE 6: force Entra ID auth; disable shared keys
networkAcls: {
defaultAction: 'Deny' // CHANGE 7: deny-by-default network access
bypass: 'AzureServices'
}
}
}
output storageAccountId string = storageAccount.id
output storageAccountName string = storageAccount.name
Summary of changes:
1. `allowBlobPublicAccess: false` — prevents anonymous internet access to blobs
2. `supportsHttpsTrafficOnly: true` — blocks HTTP connections
3. `minimumTlsVersion: 'TLS1_2'` — removes support for vulnerable TLS 1.0/1.1
4. `allowSharedKeyAccess: false` — requires Entra ID auth, enables access logging per identity
5. Unique name via `uniqueString()` — eliminates the hardcoded name that caused deployment conflicts
6. Parameterized SKU — defaults to ZRS for zone resilience; LRS was a single-datacenter failure risk
Example 2: Cosmos DB RU Optimization
Input:
"We're getting 429 throttling errors on our Cosmos DB orders container. It's provisioned at 4000 RU/s. What's wrong and how do we fix it?"
Output:
429s mean you're hitting the provisioned RU ceiling. Diagnose root cause before scaling:
# 1. Check normalized RU consumption — should stay below 100%
az monitor metrics list \
--resource /subscriptions/<sub>/resourceGroups/rg-prod/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-prod \
--metric NormalizedRUConsumption \
--dimension CollectionName \
--interval PT1M \
--start-time $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
--output table
# 2. Check if a single partition key range is hot
az monitor metrics list \
--resource /subscriptions/<sub>/resourceGroups/rg-prod/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-prod \
--metric NormalizedRUConsumption \
--dimension CollectionName PartitionKeyRangeId \
--interval PT1M \
--output table
# 3. Identify expensive queries in Diagnostics Logs
# (Requires diagnostic settings routing to Log Analytics)
// Log Analytics: Find top RU-consuming queries
AzureDiagnostics
| where Category == "DataPlaneRequests"
| where collectionName_s == "orders"
| where statusCode_s !in ("200", "201", "204")
| summarize
total_requests = count(),
avg_ru = avg(todouble(requestCharge_s)),
max_ru = max(todouble(requestCharge_s))
by activityId_g, queryText_s
| order by avg_ru desc
| take 20
Common causes and fixes:
Hot partition — all writes go to one partition key value (e.g., `status='pending'`):
// Wrong: low-cardinality partition key
partitionKey: { paths: ['/status'] }
// Right: high-cardinality key distributes load evenly
partitionKey: { paths: ['/customerId'] }
Missing index — a filter on an unindexed path causes a full partition scan costing 100s of RUs:
// Add to indexingPolicy.includedPaths
{ "path": "/createdAt/?" },
{ "path": "/status/?" }
Autoscale prevents future throttling — if load is variable, switch from manual to autoscale:
az cosmosdb sql container throughput migrate \
--account-name cosmos-prod \
--resource-group rg-prod \
--database-name appdb \
--name orders \
--throughput-type autoscale
az cosmosdb sql container throughput update \
--account-name cosmos-prod \
--resource-group rg-prod \
--database-name appdb \
--name orders \
--max-throughput 20000 # Autoscale 2000-20000 RU/s
Example 3: AKS Cluster Right-Sizing
Input:
"Our AKS cluster runs 40 nodes at 30% average CPU. How do we right-size it?"
Output:
30% average CPU with 40 nodes means you're spending for ~27 idle nodes. Approach:
# 1. Check actual resource requests vs allocatable capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
# 2. Find nodes with low utilization
kubectl top nodes --sort-by=cpu | head -20
# 3. Check autoscaler is not blocked from scaling down
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml | \
grep -A 10 "scaleDown"
# 4. List pods preventing scale-down (no PodDisruptionBudget, local storage, etc.)
kubectl get pods -A -o wide | grep <node-name>
Right-sizing plan:
# Step 1: Reduce max-count to current observed peak + 20% buffer
az aks nodepool update \
--cluster-name aks-prod \
--resource-group rg-prod \
--name apppool \
--min-count 5 \ # Up from 1 — prevents over-scale-down
--max-count 25 # Down from 40 — prevents runaway scale-out
# Step 2: Enable scale-down (often disabled by default)
# Edit cluster-autoscaler deployment or use managed profile
az aks update \
--name aks-prod \
--resource-group rg-prod \
--cluster-autoscaler-profile \
scale-down-delay-after-add=10m \
scale-down-unneeded-time=10m \
scale-down-utilization-threshold=0.5
# Step 3: Set resource requests on all deployments (autoscaler requires this)
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
Expected outcome: Cluster scales down to 12-15 nodes over 20-30 minutes as autoscaler identifies underutilized nodes. Cost reduction: ~60% on node pool compute, from ~$8,000/mo to ~$3,200/mo for Standard_D8s_v5 nodes.