Alerting Rules

Alerting is a critical part of any monitoring setup. The PCA exam covers both Prometheus alerting rules and the basics of Alertmanager configuration.

Architecture

Prometheus alerting works in two stages:

  1. Prometheus server evaluates alerting rules and sends firing alerts to Alertmanager
  2. Alertmanager handles deduplication, grouping, routing, and notification delivery
Prometheus ──(alerts)──> Alertmanager ──(notifications)──> Email/Slack/PagerDuty

Alerting Rules

Alerting rules are defined in rule files and loaded via the Prometheus configuration:

# prometheus.yml
rule_files:
  - "rules/*.yml"

Rule File Format

groups:
  - name: example-alerts
    interval: 30s  # Override evaluation interval (optional)
    rules:
      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High request latency on {{ $labels.instance }}"
          description: "95th percentile latency is {{ $value }}s (threshold: 0.5s)"

Rule Fields

Field Description
alert Alert name (must be unique within a group)
expr PromQL expression that triggers the alert
for Duration the expression must be true before firing
labels Additional labels to attach to the alert
annotations Informational labels (summary, description) for notifications

The for Clause

The for clause prevents alerts from firing on brief spikes:

  • Pending: Expression is true but for duration has not elapsed
  • Firing: Expression has been true for the entire for duration
  • Resolved: Expression is no longer true
- alert: InstanceDown
  expr: up == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Instance {{ $labels.instance }} is down"

This only fires if the instance has been down for 5 consecutive minutes.

Alertmanager Configuration

Basic Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: "default"
  group_by: ["alertname", "job"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: "default"
    email_configs:
      - to: "team@example.com"

Routing Tree

Alertmanager uses a tree-based routing structure. Each alert is matched against routes from top to bottom:

route:
  receiver: "default"
  routes:
    - match:
        severity: critical
      receiver: "pagerduty"
    - match:
        severity: warning
      receiver: "slack"
    - match_re:
        service: "^(web|api)$"
      receiver: "team-backend"

Key Parameters

Parameter Description
group_by Labels to group alerts by (reduces notification noise)
group_wait Wait time before sending first notification for a new group
group_interval Wait time between notifications for a group
repeat_interval Wait time before re-sending a notification
continue If true, continue matching subsequent sibling routes

Inhibition Rules

Suppress alerts when related alerts are firing:

inhibit_rules:
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "instance"]

This suppresses warning alerts when a critical alert with the same alertname and instance is firing.

Silences

Silences mute alerts for a given time period. They are managed through the Alertmanager web UI or API, not through configuration files.

Recording Rules

Recording rules precompute frequently used or expensive expressions:

groups:
  - name: request-rates
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

Naming convention: level:metric:operations (e.g., job:http_requests_total:rate5m)

Recording rules:

  • Reduce query latency for dashboards
  • Are evaluated at the evaluation_interval
  • Store results as new time series

Connecting Prometheus to Alertmanager

# prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

Key Exam Tips

  1. for clause: An alert without for fires immediately when the expression is true. With for, it first enters "pending" state.
  2. Alertmanager grouping: group_by is critical for reducing notification noise. Group by alertname at minimum.
  3. Recording rules vs alerting rules: Recording rules use record, alerting rules use alert. Both live in rule files.
  4. Template variables: In annotations, use {{ $labels.labelname }} for labels and {{ $value }} for the expression value.
  5. Resolve notifications: Alertmanager sends a resolve notification when an alert stops firing (after resolve_timeout).
  6. Rule evaluation order: Rules within a group are evaluated sequentially. Groups are evaluated concurrently.