Назад към всички

afrexai-observability-engine

// Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers

$ git log --oneline --stat
stars:1,933
forks:367
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
nameafrexai-observability-engine
modelstandard
descriptionComplete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization.
version1.0.0
tagsobservability, monitoring, logging, tracing, alerting, SRE, incident-response, SLO, metrics, devops, reliability, on-call, post-mortem, dashboards

Observability & Reliability Engineering

Complete system for building observable, reliable services — from structured logging to incident response to SLO-driven development.


Quick Health Check (/16)

Score your current observability posture:

SignalHealthy (2)Weak (1)Missing (0)
Structured loggingJSON logs with trace_id correlationLogs exist but unstructuredConsole.log / print statements
Metrics collectionRED/USE metrics with dashboardsSome metrics, no dashboardsNo metrics
Distributed tracingFull request path with samplingPartial traces, key services onlyNo tracing
AlertingSLO-based alerts with runbooksThreshold alerts, some runbooksNo alerts or all-noise
Incident responseDefined process with roles + post-mortemsAd-hoc response, some docs"Whoever notices fixes it"
SLOs definedSLOs with error budgets tracked weeklyInformal availability targetsNo reliability targets
On-call rotationStructured rotation with escalationInformal "call someone"No on-call
Cost managementObservability budget tracked monthlySome awareness of costsNo idea what you spend

12-16: Production-grade. Focus on optimization. 8-11: Foundation exists. Fill the gaps systematically. 4-7: Significant risk. Prioritize alerting + incident response. 0-3: Flying blind. Start with Phase 1 immediately.


Phase 1: Structured Logging

Log Architecture

Application → Structured JSON → Log Router → Storage → Query Engine
                                    ↓
                              Alert Pipeline

Required Fields (Every Log Line)

FieldTypePurposeExample
timestampISO-8601 UTCWhen2026-02-22T18:30:00.123Z
levelenumSeverityinfo, warn, error, fatal
servicestringWhich servicepayment-api
versionstringWhich deployv2.3.1
environmentstringWhich envproduction
messagestringWhat happenedPayment processed successfully
trace_idstringRequest correlationabc123def456
span_idstringOperation within tracespan_789
duration_msnumberHow long142

Contextual Fields (Add Per Domain)

# HTTP request context
http:
  method: POST
  path: /api/v1/orders
  status: 201
  client_ip: 203.0.113.42  # Anonymize in logs if needed
  user_agent: "Mozilla/5.0..."
  request_id: "req_abc123"

# Business context
business:
  user_id: "usr_456"
  tenant_id: "tenant_789"
  order_id: "ord_012"
  action: "checkout"
  amount_cents: 4999
  currency: "USD"

# Error context
error:
  type: "PaymentDeclinedError"
  message: "Card declined: insufficient funds"
  code: "CARD_DECLINED"
  stack: "..." # Only in non-production or DEBUG level
  retry_count: 2
  retryable: true

Log Level Decision Tree

Is the process about to crash?
  → FATAL (exit after logging)

Did an operation fail that needs human attention?
  → ERROR (page someone or create ticket)

Did something unexpected happen but we recovered?
  → WARN (review in daily triage)

Is this a normal business event worth recording?
  → INFO (audit trail, business metrics)

Is this useful for debugging but noisy in production?
  → DEBUG (off in prod, on in staging)

Is this only useful when stepping through code?
  → TRACE (never in production)

Log Level Rules

  1. ERROR means action required — if no one needs to act on it, it's WARN
  2. INFO is for business events — not internal implementation details
  3. No logging inside tight loops — aggregate and log summary
  4. Log at boundaries — API entry/exit, queue consume/publish, DB calls
  5. Never log secrets — API keys, tokens, passwords, PII (see scrubbing below)

PII & Secret Scrubbing

scrub_patterns:
  # Always redact
  - field_patterns: ["password", "secret", "token", "api_key", "authorization"]
    action: replace_with_redacted
  
  # Hash for correlation without exposure
  - field_patterns: ["email", "phone", "ssn", "national_id"]
    action: sha256_hash
  
  # Mask partially
  - field_patterns: ["credit_card", "card_number"]
    action: mask_last_4  # "****-****-****-1234"
  
  # IP anonymization
  - field_patterns: ["client_ip", "ip_address"]
    action: zero_last_octet  # 203.0.113.0

Logger Setup (By Language)

Node.js (Pino):

import pino from 'pino';
import { AsyncLocalStorage } from 'node:async_hooks';

const als = new AsyncLocalStorage<Record<string, string>>();

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  mixin: () => als.getStore() ?? {},
  redact: ['req.headers.authorization', '*.password', '*.token'],
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Middleware: inject context
app.use((req, res, next) => {
  const ctx = {
    trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
    request_id: crypto.randomUUID(),
    service: 'payment-api',
    version: process.env.APP_VERSION,
  };
  als.run(ctx, () => next());
});

Python (structlog):

import structlog
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso", utc=True),
        structlog.processors.JSONRenderer(),
    ],
)
log = structlog.get_logger()
# Bind context per-request:
structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id)

Go (zerolog):

log := zerolog.New(os.Stdout).With().
    Timestamp().
    Str("service", "payment-api").
    Str("version", version).
    Logger()
// Per-request:
reqLog := log.With().Str("trace_id", traceID).Logger()

Log Storage Decision

VolumeSolutionRetentionCost
<10 GB/dayLoki + Grafana30 days hot, 90 days coldLow
10-100 GB/dayElasticsearch / OpenSearch14 days hot, 90 days S3Medium
100+ GB/dayClickHouse or Datadog7 days hot, 30 days archiveHigh
Budget-constrainedLoki + S3 backend90 days all coldVery low

10 Logging Anti-Patterns

#Anti-PatternFix
1log.error(err) with no contextAlways include: what operation, what input, what state
2Logging request/response bodiesLog only in DEBUG; redact sensitive fields
3String concatenation in log messagesUse structured fields: log.info("processed", { order_id, amount })
4Catch-and-log-and-rethrowLog at the boundary where you handle it, not every layer
5Different log formats per serviceStandardize schema across all services
6No log rotation / retention policySet max size + TTL; archive to cold storage
7Logging inside hot pathsAggregate: log summary every N items or every interval
8Missing correlation IDsPropagate trace_id from first entry point through all services
9Boolean log levels (verbose: true)Use standard levels with configurable minimum
10Logging PII in plain textImplement scrubbing at the logger level

Phase 2: Metrics Collection

The RED Method (Request-Driven Services)

For every service endpoint, track:

MetricWhatPrometheus Example
RateRequests per secondhttp_requests_total{method, path, status}
ErrorsFailed requests per secondhttp_requests_total{status=~"5.."} / total
DurationLatency distributionhttp_request_duration_seconds{method, path} (histogram)

The USE Method (Infrastructure Resources)

For every resource (CPU, memory, disk, network):

MetricWhatExample
Utilization% resource busyCPU usage 78%
SaturationQueue depth / backpressure12 requests queued
ErrorsResource errors3 disk I/O errors

Golden Signals (Google SRE)

SignalMeaningSource
LatencyTime to serve requestsRED Duration
TrafficDemand on the systemRED Rate
ErrorsRate of failed requestsRED Errors
SaturationHow "full" the service isUSE Saturation

Metric Types & When to Use Each

TypeUse CaseExample
CounterThings that only go upTotal requests, errors, bytes sent
GaugeCurrent value that goes up/downActive connections, queue depth, temperature
HistogramDistribution of valuesRequest latency, response size
SummaryPre-calculated percentilesClient-side latency (when you need exact percentiles)

Rule: Use histograms over summaries in most cases — they're aggregatable across instances.

Naming Conventions

# Pattern: <namespace>_<subsystem>_<name>_<unit>
http_server_request_duration_seconds
http_server_requests_total
db_pool_connections_active
queue_messages_pending
cache_hit_ratio

# Rules:
# 1. Use snake_case
# 2. Include unit suffix (_seconds, _bytes, _total)
# 3. _total suffix for counters
# 4. Don't include label names in metric name
# 5. Use base units (seconds not milliseconds, bytes not kilobytes)

Label Design Rules

RuleWhyExample
Keep cardinality <100 per labelHigh cardinality kills performancestatus="200" not status="200 OK"
No user IDs as labelsUnbounded cardinalityUse log correlation instead
No request paths with IDs/api/users/123 creates millions of seriesNormalize: /api/users/:id
Max 5-7 labels per metricEach combo = a time series{method, path, status, service}

Instrumentation Checklist

application_metrics:
  # HTTP layer
  - http_request_duration_seconds: histogram {method, path, status}
  - http_request_size_bytes: histogram {method, path}
  - http_response_size_bytes: histogram {method, path}
  - http_requests_in_flight: gauge
  
  # Business logic
  - orders_processed_total: counter {status, payment_method}
  - order_value_dollars: histogram {payment_method}
  - user_signups_total: counter {source}
  
  # Dependencies
  - db_query_duration_seconds: histogram {query_type, table}
  - db_connections_active: gauge {pool}
  - db_connections_idle: gauge {pool}
  - cache_requests_total: counter {result: hit|miss}
  - external_api_duration_seconds: histogram {service, endpoint}
  - external_api_errors_total: counter {service, error_type}
  
  # Queue / async
  - queue_messages_published_total: counter {queue}
  - queue_messages_consumed_total: counter {queue, status}
  - queue_processing_duration_seconds: histogram {queue}
  - queue_depth: gauge {queue}
  - queue_consumer_lag: gauge {queue, consumer_group}

infrastructure_metrics:
  # Node exporter / cAdvisor provides these automatically
  - cpu_usage_percent: gauge {instance}
  - memory_usage_bytes: gauge {instance}
  - disk_usage_bytes: gauge {instance, mount}
  - disk_io_seconds: counter {instance, device}
  - network_bytes: counter {instance, direction}
  - container_cpu_usage: gauge {pod, container}
  - container_memory_usage: gauge {pod, container}

Stack Recommendations

ComponentOptionsRecommendation
CollectionPrometheus, OTEL Collector, Datadog AgentPrometheus (free) or OTEL Collector (vendor-neutral)
StoragePrometheus, Thanos, Mimir, VictoriaMetricsVictoriaMetrics (best cost/perf) or Mimir (Grafana ecosystem)
VisualizationGrafana, Datadog, New RelicGrafana (free, extensible)
AlertingAlertmanager, Grafana Alerting, PagerDutyAlertmanager + PagerDuty routing

Phase 3: Distributed Tracing

Trace Architecture

Client Request
  → API Gateway (root span)
    → Auth Service (child span)
    → Order Service (child span)
      → Database Query (child span)
      → Payment Service (child span)
        → Stripe API (child span)
    → Notification Service (child span)
      → Email Provider (child span)

OpenTelemetry Setup

Auto-instrumentation (Node.js):

// tracing.ts — import BEFORE anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/ready'] },
    '@opentelemetry/instrumentation-express': { enabled: true },
  })],
  serviceName: process.env.OTEL_SERVICE_NAME || 'payment-api',
});
sdk.start();

Custom spans for business logic:

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

async function processPayment(order: Order) {
  return tracer.startActiveSpan('process-payment', async (span) => {
    span.setAttributes({
      'order.id': order.id,
      'order.amount_cents': order.amountCents,
      'payment.method': order.paymentMethod,
    });
    try {
      const result = await chargeCard(order);
      span.setAttributes({ 'payment.status': result.status });
      return result;
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Sampling Strategies

StrategyWhenConfig
Always OnDev/staging, low traffic (<100 rps)ratio: 1.0
ProbabilisticModerate traffic (100-1000 rps)ratio: 0.1 (10%)
Rate-limitedHigh traffic (>1000 rps)max_traces_per_second: 100
Tail-basedWant all errors + slow requestsCollector-side: keep if error OR duration > p99
Parent-basedRespect upstream decisionsIf parent sampled, child sampled

Recommendation: Start with parent-based + probabilistic (10%). Add tail-based at the collector to capture all errors.

Context Propagation

HeaderStandardFormat
traceparentW3C Trace Context00-{trace_id}-{span_id}-{flags}
tracestateW3C Trace ContextVendor-specific key-value pairs
b3Zipkin B3{trace_id}-{span_id}-{sampled}

Rule: Use W3C Trace Context (traceparent) as primary. Support B3 for legacy Zipkin systems.

Trace Storage

VolumeSolutionRetention
<50 GB/dayJaeger + Elasticsearch7 days
50-500 GB/dayTempo + S314 days
500+ GB/dayTempo + S3 with aggressive sampling7 days
Budget-constrainedJaeger + Badger (local disk)3 days

Phase 4: SLOs, SLIs & Error Budgets

SLI Selection by Service Type

Service TypePrimary SLISecondary SLIMeasurement
API / WebAvailability + LatencyError rateServer-side + synthetic
Data pipelineFreshness + CorrectnessThroughputPipeline timestamps + checksums
StorageDurability + AvailabilityLatencyChecksums + uptime monitoring
StreamingThroughput + LatencyMessage loss rateConsumer lag + e2e latency
Batch jobsSuccess rate + FreshnessDurationJob scheduler metrics

SLO Definition Template

slo:
  name: "Payment API Availability"
  service: payment-api
  owner: payments-team
  
  sli:
    type: availability
    definition: "Proportion of non-5xx responses"
    measurement: |
      sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
      /
      sum(rate(http_requests_total{service="payment-api"}[5m]))
    
  target: 99.95%  # 21.9 min downtime/month
  window: rolling_30d
  
  error_budget:
    total_minutes: 21.9  # per 30 days
    burn_rate_alerts:
      - severity: critical
        burn_rate: 14.4x  # Budget consumed in 2 hours
        short_window: 5m
        long_window: 1h
      - severity: warning
        burn_rate: 6x    # Budget consumed in 5 days
        short_window: 30m
        long_window: 6h
      - severity: ticket
        burn_rate: 1x    # Budget consumed in 30 days
        short_window: 6h
        long_window: 3d
  
  consequences:
    budget_remaining_above_50pct: "Normal development velocity"
    budget_remaining_20_to_50pct: "Prioritize reliability work"
    budget_remaining_below_20pct: "Feature freeze; reliability only"
    budget_exhausted: "All hands on reliability until budget recovers"

Common SLO Targets

Service TierAvailabilityp50 Latencyp99 LatencyMonthly Downtime
Tier 0 (payments, auth)99.99%<100ms<500ms4.3 min
Tier 1 (core API)99.95%<200ms<1s21.9 min
Tier 2 (non-critical)99.9%<500ms<2s43.8 min
Tier 3 (internal tools)99.5%<1s<5s3.6 hours
Batch / pipeline99% (success rate)N/AN/AN/A

Error Budget Tracking

# Weekly error budget review template
error_budget_review:
  week: "2026-W08"
  service: payment-api
  slo_target: 99.95%
  
  budget:
    total_minutes_this_period: 21.9
    consumed_minutes: 8.2
    remaining_minutes: 13.7
    remaining_percent: 62.6%
    
  incidents_consuming_budget:
    - date: "2026-02-18"
      duration_minutes: 5.1
      cause: "Database connection pool exhaustion"
      preventable: true
      action: "Increase pool size + add saturation alert"
    - date: "2026-02-20"
      duration_minutes: 3.1
      cause: "Upstream payment provider timeout"
      preventable: false
      action: "Add circuit breaker with fallback"
  
  velocity_decision: "Normal — 62.6% budget remaining"
  reliability_work_this_week:
    - "Add connection pool saturation alert"
    - "Implement circuit breaker for payment provider"

Phase 5: Alert Design

Alert Quality Principles

  1. Every alert must be actionable — if no one needs to act, it's not an alert
  2. Every alert needs a runbook — linked directly in the alert annotation
  3. Symptom-based over cause-based — alert on "users can't checkout" not "CPU high"
  4. Multi-window burn rate — not static thresholds (see SLO alerts above)
  5. Alert on absence, not just presence — "no orders in 15 min" catches silent failures

Alert Severity Levels

SeverityResponse TimeChannelWhoExample
P0 — Critical<5 minPage (PagerDuty/Opsgenie)On-call engineerPayment system down
P1 — High<30 minPage during business hours, Slack 24/7On-callError rate >5% for 10 min
P2 — Medium<4 hoursSlack channelTeamp99 latency degraded 2x
P3 — LowNext business dayTicket auto-createdTeam backlogDisk usage >80%
InfoN/ADashboard onlyNo oneDeploy completed

Alerting Anti-Patterns

Anti-PatternProblemFix
Static CPU/memory thresholdsNoisy, not user-impactingUse SLO-based burn rate alerts
Alert per instance50 instances = 50 alerts for same issueAggregate: alert on service-level error rate
No deduplicationSame alert fires 100 timesGroup by service + alert name; set repeat interval
Missing runbookEngineer gets paged, doesn't know what to doEvery alert links to a runbook
Threshold too sensitiveFires on brief spikesUse for: 5m to require sustained condition
Too many P0sAlert fatigue → ignoring real incidentsAudit monthly; demote or remove noisy alerts

Alert Template (Prometheus Alertmanager)

groups:
  - name: payment-api-slo
    rules:
      - alert: PaymentAPIHighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="payment-api"}[5m]))
          ) > 0.01
        for: 5m
        labels:
          severity: critical
          service: payment-api
          team: payments
        annotations:
          summary: "Payment API error rate {{ $value | humanizePercentage }} (>1%)"
          description: "5xx error rate has exceeded 1% for 5 minutes"
          runbook: "https://wiki.internal/runbooks/payment-api-errors"
          dashboard: "https://grafana.internal/d/payment-api"
          
      - alert: PaymentAPINoTraffic
        expr: |
          sum(rate(http_requests_total{service="payment-api"}[15m])) == 0
        for: 5m
        labels:
          severity: critical
          service: payment-api
        annotations:
          summary: "Payment API receiving zero traffic for 5 minutes"
          runbook: "https://wiki.internal/runbooks/payment-api-no-traffic"

      - alert: PaymentAPILatencyHigh
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m])) by (le)
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Payment API p99 latency {{ $value }}s (>2s for 10min)"
          runbook: "https://wiki.internal/runbooks/payment-api-latency"

Runbook Template

# Runbook: PaymentAPIHighErrorRate

## What This Alert Means
The payment API is returning >1% 5xx errors over a 5-minute window.
Users are likely failing to complete checkouts.

## Impact
- Users cannot process payments
- Revenue loss: ~$X per minute (based on average traffic)
- SLO: Payment API availability (target: 99.95%)

## Immediate Actions
1. Check the error dashboard: [link]
2. Check recent deploys: `kubectl rollout history deployment/payment-api`
3. Check upstream dependencies:
   - Database: [dashboard link]
   - Stripe API: [status page]
   - Redis cache: [dashboard link]
4. Check application logs:

kubectl logs -l app=payment-api --since=10m | jq 'select(.level=="error")'


## Common Causes & Fixes
| Cause | Diagnosis | Fix |
|-------|-----------|-----|
| Bad deploy | Errors started at deploy time | `kubectl rollout undo deployment/payment-api` |
| DB connection exhaustion | `db_connections_active` at max | Restart pods (rolling) + increase pool size |
| Stripe outage | Stripe status page red | Enable fallback payment processor |
| Memory leak | Memory climbing, OOMKilled events | Rolling restart + investigate |

## Escalation
- If unresolved after 15 min: page payment team lead
- If revenue impact >$10K: page VP Engineering
- If Stripe outage: communicate to support team for customer messaging

## Resolution
- Confirm error rate <0.1% for 10 min
- Post in #incidents: root cause + duration + impact
- Schedule post-mortem if downtime >5 min

Phase 6: Dashboard Architecture

Dashboard Hierarchy

L1: Executive / Business Dashboard (non-technical stakeholders)
  ↓
L2: Service Overview Dashboard (on-call, quick triage)
  ↓
L3: Service Deep-Dive Dashboard (debugging specific service)
  ↓
L4: Infrastructure Dashboard (resource-level details)

L1: Business Dashboard

panels:
  - title: "Revenue per Minute"
    type: stat
    query: "sum(rate(orders_total{status='completed'}[5m])) * avg(order_value_dollars)"
  - title: "Active Users (5min)"
    type: stat
    query: "count(count by (user_id) (http_requests_total{...}[5m]))"
  - title: "Checkout Success Rate"
    type: gauge
    query: "sum(rate(checkout_total{status='success'}[1h])) / sum(rate(checkout_total[1h]))"
    thresholds: [95, 98, 99.5]
  - title: "Error Budget Remaining"
    type: gauge
    query: "1 - (error_budget_consumed / error_budget_total)"

L2: Service Overview Dashboard

Every service gets one of these with identical layout:

row_1_traffic:
  - "Request Rate (rps)" — timeseries, by status code
  - "Error Rate (%)" — timeseries, threshold line at SLO
  - "Active Requests" — gauge

row_2_latency:
  - "Latency Distribution" — heatmap
  - "p50 / p95 / p99" — timeseries, threshold lines
  - "Latency by Endpoint" — table, sorted by p99

row_3_dependencies:
  - "Downstream Latency" — timeseries per dependency
  - "Downstream Error Rate" — timeseries per dependency
  - "Database Query Duration" — timeseries by query type

row_4_resources:
  - "CPU Usage" — timeseries per pod
  - "Memory Usage" — timeseries per pod
  - "Pod Restarts" — stat

row_5_business:
  - "Business Metric 1" — service-specific
  - "Business Metric 2" — service-specific

Dashboard Rules

  1. Time range default: last 1 hour — most debugging happens in recent time
  2. Variable selectors at top: environment, service, instance
  3. Consistent color coding: green=good, yellow=degraded, red=bad across all dashboards
  4. Link alerts to dashboards — every alert annotation includes dashboard URL
  5. No more than 15 panels per dashboard — split into L3 if needed
  6. Include "as of" timestamp — so screenshots in incidents are unambiguous
  7. Dashboard as code — store Grafana JSON in git, provision via API

Phase 7: Incident Response

Incident Severity Classification

SeverityCriteriaResponseCommunication
SEV-1Service down, data loss risk, security breachAll hands, war roomStatus page update every 15 min
SEV-2Degraded service, SLO at risk, partial outageOn-call + backupStatus page update every 30 min
SEV-3Minor degradation, workaround existsOn-call during hoursInternal Slack update
SEV-4Cosmetic, low impactNext sprintNone

Incident Roles

RoleResponsibilityWho
Incident Commander (IC)Owns the incident. Coordinates. Makes decisions.On-call lead
Technical LeadDiagnoses and fixes. Communicates technical status to IC.Senior engineer
Communications LeadUpdates status page, Slack, stakeholders.Product/support
ScribeDocuments timeline, actions, decisions in real-time.Anyone available

Incident Response Workflow

1. DETECT
   - Alert fires → on-call paged
   - Customer report → support escalates
   - Internal discovery → engineer reports
   
2. TRIAGE (first 5 minutes)
   - Confirm the issue is real (not false alert)
   - Classify severity (SEV-1 through SEV-4)
   - Open incident channel: #inc-YYYY-MM-DD-short-description
   - Assign roles (IC, Tech Lead, Comms)
   
3. MITIGATE (next 5-30 minutes)
   - Goal: STOP THE BLEEDING, not find root cause
   - Options (try in order):
     a. Rollback last deploy
     b. Scale up / restart pods
     c. Toggle feature flag off
     d. Redirect traffic / enable fallback
     e. Manual data fix
   - Document every action with timestamp
   
4. STABILIZE
   - Confirm mitigation is working (metrics back to normal)
   - Monitor for 15-30 min for recurrence
   - Update status page: "Monitoring fix"
   
5. RESOLVE
   - Confirm all metrics healthy for 30+ min
   - Update status page: "Resolved"
   - Schedule post-mortem (within 48 hours for SEV-1/2)
   - Send internal summary to stakeholders

Incident Channel Template

📋 Incident: Payment API 5xx Errors
🔴 Severity: SEV-2
🕐 Started: 2026-02-22 14:23 UTC
👤 IC: @alice
🔧 Tech Lead: @bob
📢 Comms: @charlie

Status: MITIGATING
Impact: ~5% of checkout requests failing
Customer-facing: Yes

Timeline:
14:23 — Alert fired: PaymentAPIHighErrorRate
14:25 — IC assigned: @alice, confirmed real via dashboard
14:28 — Tech Lead: error logs show connection pool exhaustion post-deploy
14:31 — Rolled back deployment v2.3.1 → v2.3.0
14:35 — Error rate dropping, monitoring
14:50 — Error rate <0.1%, marking resolved

Phase 8: Post-Mortem Framework

Blameless Post-Mortem Template

post_mortem:
  title: "Payment API Connection Pool Exhaustion"
  date: "2026-02-22"
  severity: SEV-2
  duration: 27 minutes (14:23 — 14:50 UTC)
  authors: ["@alice", "@bob"]
  reviewers: ["@engineering-leads"]
  status: action_items_in_progress
  
  summary: |
    A deployment at 14:15 introduced a connection leak in the payment API.
    Connection pool was exhausted by 14:23, causing 5xx errors for ~5% of
    checkout requests. Rolled back at 14:31; recovered by 14:50.
  
  impact:
    user_impact: "~340 users saw checkout failures over 27 minutes"
    revenue_impact: "$2,100 estimated (based on average order value × failed checkouts)"
    slo_impact: "Consumed 5.1 min of 21.9 min monthly error budget (23%)"
    data_impact: "No data loss. 12 orders failed; users could retry successfully."
  
  timeline:
    - time: "14:15"
      event: "Deploy v2.3.1 rolled out (3/3 pods updated)"
    - time: "14:23"
      event: "PaymentAPIHighErrorRate alert fired"
    - time: "14:25"
      event: "IC assigned, confirmed via dashboard"
    - time: "14:28"
      event: "Root cause identified: new ORM query not releasing connections"
    - time: "14:31"
      event: "Rollback initiated: v2.3.1 → v2.3.0"
    - time: "14:35"
      event: "Error rate declining"
    - time: "14:50"
      event: "Resolved: error rate <0.1% sustained"
  
  root_cause: |
    The v2.3.1 deploy introduced a new database query in the order validation
    path. The query used a raw connection instead of the pool's managed client,
    so connections were acquired but never released. Under load, the pool
    exhausted within 8 minutes.
  
  contributing_factors:
    - "No integration test for connection pool behavior under load"
    - "Connection pool saturation metric existed but had no alert"
    - "Code review didn't catch raw connection usage"
  
  what_went_well:
    - "Alert fired within 8 minutes of deploy"
    - "IC assigned in 2 minutes"
    - "Root cause identified in 3 minutes (clear in logs)"
    - "Rollback executed cleanly"
  
  what_went_wrong:
    - "8-minute detection gap after deploy"
    - "No canary deployment to catch before full rollout"
    - "Connection pool saturation had no alert"
  
  action_items:
    - action: "Add connection pool saturation alert (>80% for 2 min)"
      owner: "@bob"
      priority: P1
      due: "2026-02-25"
      status: in_progress
      ticket: "ENG-1234"
    - action: "Enable canary deployments for payment-api"
      owner: "@alice"
      priority: P1
      due: "2026-03-01"
      ticket: "ENG-1235"
    - action: "Add linting rule: no raw DB connections in application code"
      owner: "@charlie"
      priority: P2
      due: "2026-03-07"
      ticket: "ENG-1236"
    - action: "Load test payment-api connection pool in staging"
      owner: "@bob"
      priority: P2
      due: "2026-03-07"
      ticket: "ENG-1237"
  
  lessons_learned:
    - "Resource saturation metrics need alerts, not just dashboards"
    - "Canary deployments are mandatory for Tier 0 services"
    - "ORM abstractions don't guarantee connection safety — review raw queries"

Post-Mortem Meeting Agenda (60 minutes)

1. (5 min) Context setting — IC reads the summary
2. (15 min) Timeline walkthrough — what happened, when, by whom
3. (15 min) Root cause deep-dive — 5 Whys exercise
4. (5 min) What went well — celebrate good response
5. (15 min) Action items — assign owners, priorities, due dates
6. (5 min) Wrap-up — review date for action item check-in

5 Whys Exercise

Problem: 5xx errors in payment API

Why 1: Database connections were exhausted
Why 2: A new query acquired connections without releasing them
Why 3: The query used a raw connection instead of the pool manager
Why 4: The ORM's raw query API doesn't auto-release (by design)
Why 5: We don't have a linting rule or code review checklist item for this

Root cause: Missing guard against raw connection usage in application code
Systemic fix: Linting rule + connection pool saturation alerting

Phase 9: On-Call Operations

On-Call Structure

on_call:
  rotation: weekly
  handoff_day: Monday 10:00 UTC
  
  primary:
    response_time: 5 minutes (SEV-1/2), 30 minutes (SEV-3)
    escalation_after: 15 minutes no-ack
    
  secondary:
    response_time: 15 minutes (SEV-1), 1 hour (SEV-2/3)
    escalation_after: 30 minutes no-ack
    
  manager_escalation:
    trigger: SEV-1 unresolved after 30 minutes
    
  handoff_checklist:
    - Review open incidents and active alerts
    - Check error budget status for all services
    - Read post-mortems from previous week
    - Verify PagerDuty schedule and contact info
    - Test alert routing (send test page)

On-Call Health Metrics

MetricHealthyNeeds AttentionUnhealthy
Pages per week<55-15>15
After-hours pages per week<22-5>5
False positive rate<10%10-30%>30%
Mean time to acknowledge<5 min5-15 min>15 min
Mean time to resolve<30 min30-120 min>120 min
Toil ratio (manual vs automated)<30%30-60%>60%

Weekly On-Call Review Template

on_call_review:
  week: "2026-W08"
  engineer: "@bob"
  
  incidents:
    total: 7
    sev_1: 0
    sev_2: 1
    sev_3: 4
    false_positives: 2
    after_hours: 3
    
  time_spent:
    incident_response: "4.5 hours"
    toil_automation: "2 hours"
    runbook_updates: "1 hour"
    
  improvements_made:
    - "Silenced noisy disk alert on dev servers"
    - "Added auto-remediation for pod restart threshold"
    
  improvements_needed:
    - "Cache expiry alert fires every Tuesday at 03:00 — needs investigation"
    - "Payment retry logic needs circuit breaker (caused 3 alerts)"
    
  handoff_notes: |
    Watch payment-api p99 latency — it's been creeping up since Wednesday.
    Stripe changed their sandbox endpoints; staging may throw errors.

Phase 10: Chaos Engineering & Reliability Testing

Chaos Principles

  1. Start with a hypothesis: "If X fails, the system should Y"
  2. Run in production (start small — one instance, one AZ)
  3. Minimize blast radius with automatic rollback
  4. Build confidence incrementally: staging → canary → production

Chaos Experiment Template

chaos_experiment:
  name: "Payment DB failover"
  hypothesis: "If the primary database becomes unavailable, traffic should
    failover to the replica within 30 seconds with <1% error rate spike"
  
  steady_state:
    - metric: "checkout_success_rate"
      expected: ">99.5%"
    - metric: "db_query_duration_p99"
      expected: "<200ms"
  
  injection:
    type: "network_partition"
    target: "payment-db-primary"
    duration: "5 minutes"
    blast_radius: "single AZ"
  
  abort_conditions:
    - "checkout_success_rate < 95% for > 60 seconds"
    - "revenue_per_minute drops > 50%"
    - "any SEV-1 incident declared"
  
  results:
    failover_time: "22 seconds"
    error_spike: "0.3% for 25 seconds"
    hypothesis_confirmed: true
    
  follow_up_actions:
    - "Document failover behavior in runbook"
    - "Add failover time as SLI (target: <30s)"

Chaos Engineering Maturity Levels

LevelWhat You TestTools
1: ManualKill a pod, see what happenskubectl delete pod
2: AutomatedScheduled pod kills, network delaysChaos Monkey, Litmus
3: Game DaysMulti-failure scenarios with team exerciseCustom scripts + coordination
4: ContinuousAutomated chaos in production with auto-rollbackGremlin, Chaos Mesh

Phase 11: Observability Cost Optimization

Cost Drivers (Ranked)

#DriverTypical % of BillOptimization
1Log volume40-60%Reduce verbosity, drop DEBUG, sample repetitive
2Metric cardinality15-25%Drop unused metrics, limit labels
3Trace volume10-20%Sampling, tail-based sampling
4Retention10-15%Tiered storage (hot → warm → cold)
5Query cost5-10%Optimize dashboard queries, set max scan limits

Cost Reduction Checklist

cost_optimization:
  logs:
    - action: "Drop DEBUG/TRACE in production"
      savings: "30-50% of log volume"
    - action: "Sample health check logs (1:100)"
      savings: "5-15% of log volume"
    - action: "Deduplicate identical error bursts"
      savings: "10-20% during incidents"
    - action: "Move logs older than 7 days to S3/cold storage"
      savings: "60-80% of storage cost"
    - action: "Drop request/response body logging"
      savings: "20-40% of log volume"
  
  metrics:
    - action: "Audit unused metrics (no dashboard, no alert)"
      savings: "10-30% of series"
    - action: "Reduce histogram bucket count (default 11 → 8)"
      savings: "~27% of histogram series"
    - action: "Remove high-cardinality labels"
      savings: "Variable — can be massive"
    - action: "Increase scrape interval for non-critical metrics (15s → 60s)"
      savings: "75% of data points for those metrics"
  
  traces:
    - action: "Implement tail-based sampling"
      savings: "80-95% of trace volume"
    - action: "Drop internal health check traces"
      savings: "5-20% of trace volume"
    - action: "Reduce span attribute size (truncate long strings)"
      savings: "10-30% of trace storage"
  
  general:
    - action: "Review and right-size retention policies quarterly"
    - action: "Set query timeouts and result limits on dashboards"
    - action: "Use recording rules for expensive queries"

Monthly Cost Review Template

observability_cost_review:
  month: "February 2026"
  total_cost: "$X,XXX"
  
  breakdown:
    logs: { volume: "X TB", cost: "$X", pct: "X%" }
    metrics: { series: "X million", cost: "$X", pct: "X%" }
    traces: { volume: "X TB", cost: "$X", pct: "X%" }
    infrastructure: { instances: X, cost: "$X", pct: "X%" }
  
  cost_per:
    request: "$0.000X"
    service: "$X average"
    engineer: "$X per engineer"
  
  optimizations_applied: []
  optimizations_planned: []
  budget_status: "on_track | over_budget | under_budget"

Phase 12: Advanced Patterns

Correlation: Connecting the Three Pillars

Every log line includes: trace_id, span_id
Every trace span includes: service, operation
Every metric includes: service label

Correlation paths:
  Alert fires (metric) → Click → Dashboard (metric) → Filter by time window
    → Trace search (same service + time) → Find failing trace
    → Logs (filter by trace_id) → See exact error
    
  Support ticket (user report) → Find request_id in logs
    → Extract trace_id → View full trace → Identify slow span
    → Check span's service metrics → Confirm pattern

Synthetic Monitoring

synthetic_checks:
  - name: "Checkout flow"
    type: browser
    frequency: 5m
    locations: [us-east, eu-west, ap-southeast]
    steps:
      - navigate: "https://app.example.com/products"
      - click: "Add to Cart"
      - click: "Checkout"
      - assert: "Order confirmation page loads in <3s"
    alert_on: "2 consecutive failures from same location"
    
  - name: "API health"
    type: api
    frequency: 1m
    endpoints:
      - url: "https://api.example.com/health"
        expected_status: 200
        max_latency_ms: 500
      - url: "https://api.example.com/v1/products?limit=1"
        expected_status: 200
        max_latency_ms: 1000

Feature Flag Observability

# Correlate feature flags with metrics
feature_flag_monitoring:
  - flag: "new_checkout_flow"
    metrics_to_compare:
      - "checkout_conversion_rate" # by flag variant
      - "checkout_error_rate"
      - "checkout_latency_p99"
    alerts:
      - "If error rate for new variant > 2x control, auto-disable flag"

Observability Maturity Model

DimensionLevel 1Level 2Level 3Level 4
LoggingUnstructured logsStructured JSON, centralizedCorrelated with tracesAutomated log analysis
MetricsBasic infra metricsRED/USE for servicesSLO-based with error budgetsPredictive (anomaly detection)
TracingNo tracingKey services instrumentedFull distributed tracingTrace-driven testing
AlertingStatic thresholdsMulti-signal alertsBurn-rate based on SLOsAuto-remediation
Incident ResponseAd hocDefined process + rolesPost-mortems with action trackingChaos engineering in prod
Culture"Ops team handles it"Shared ownership (you build it, you run it)SLO-driven development velocityReliability as a feature

Quality Scoring Rubric (0-100)

DimensionWeight0510
Logging quality15%Unstructured, no correlationStructured JSON, missing fieldsFull schema, trace correlation, PII scrubbing
Metrics coverage15%No metricsRED or USE, not bothRED + USE + business metrics + custom
Tracing completeness10%No tracingKey servicesFull path, sampling strategy, tail-based
SLO maturity15%No reliability targetsInformal targetsSLOs with error budgets, burn-rate alerts, weekly review
Alert quality15%Noisy/missingActionable, some runbooksSLO-based, full runbooks, low false positive
Incident response10%Ad hocDefined processFull process, roles, post-mortems, chaos engineering
Dashboard design10%No dashboardsBasic panelsHierarchical L1-L4, consistent, linked to alerts
Cost efficiency10%Unknown costTrackedOptimized, reviewed monthly, within budget

90-100: World-class. Teach others. 70-89: Production-ready. Fill specific gaps. 50-69: Functional but fragile. <50: Significant reliability risk.


10 Observability Commandments

  1. Structured or it didn't happen — unstructured logs are technical debt
  2. Correlate everything — trace_id connects logs, traces, and metrics
  3. Alert on symptoms, not causes — users don't care about CPU, they care about latency
  4. Every alert gets a runbook — no runbook = no alert
  5. SLOs drive velocity — error budgets decide when to ship vs stabilize
  6. Dashboards have hierarchy — executives don't need pod CPU graphs
  7. Blameless post-mortems always — blame prevents learning
  8. Cost is a feature — observability that bankrupts you isn't observability
  9. You build it, you run it — the team that ships code owns its observability
  10. Practice failure — chaos engineering builds confidence

12 Natural Language Commands

CommandWhat It Does
"Audit our observability"Run the /16 health check, score each dimension, prioritize gaps
"Design logging for [service]"Generate structured log schema with context fields for the service
"Set up metrics for [service]"Create RED + USE + business metric instrumentation plan
"Create SLOs for [service]"Define SLIs, targets, error budgets, and burn-rate alert rules
"Design alerts for [service]"Create alert rules with severity, thresholds, and runbook templates
"Build dashboard for [service]"Design L2 service overview dashboard with panel specifications
"Write a runbook for [alert]"Generate structured runbook with diagnosis steps and fixes
"Run post-mortem for [incident]"Generate blameless post-mortem document with timeline and action items
"Set up on-call for [team]"Design rotation, escalation policy, handoff checklist
"Plan chaos experiment for [scenario]"Design experiment with hypothesis, injection, abort conditions
"Optimize observability costs"Audit current spend, identify top savings, create reduction plan
"Design tracing for [system]"Create OpenTelemetry instrumentation plan with sampling strategy

⚡ Level Up Your Observability

This skill gives you the methodology. For industry-specific implementation patterns:

🔗 More Free Skills by AfrexAI

  • afrexai-devops-engine — CI/CD, infrastructure, deployment strategies
  • afrexai-api-architect — API design, security, versioning
  • afrexai-database-engineering — Schema design, query optimization, migrations
  • afrexai-code-reviewer — Code review methodology with SPEAR framework
  • afrexai-prompt-engineering — System prompt design, testing, optimization

Browse all AfrexAI skills: clawhub.com | Full storefront