Alerting Rules

Alerting is a critical part of any monitoring setup. The PCA exam covers both Prometheus alerting rules and the basics of Alertmanager configuration.

Architecture

Prometheus alerting works in two stages:

Prometheus server evaluates alerting rules and sends firing alerts to Alertmanager
Alertmanager handles deduplication, grouping, routing, and notification delivery

Prometheus ──(alerts)──> Alertmanager ──(notifications)──> Email/Slack/PagerDuty

Alerting Rules

Alerting rules are defined in rule files and loaded via the Prometheus configuration:

# prometheus.yml
rule_files:
  - "rules/*.yml"

Rule File Format

groups:
  - name: example-alerts
    interval: 30s  # Override evaluation interval (optional)
    rules:
      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High request latency on {{ $labels.instance }}"
          description: "95th percentile latency is {{ $value }}s (threshold: 0.5s)"

Rule Fields

Field	Description
`alert`	Alert name (must be unique within a group)
`expr`	PromQL expression that triggers the alert
`for`	Duration the expression must be true before firing
`labels`	Additional labels to attach to the alert
`annotations`	Informational labels (summary, description) for notifications

The `for` Clause

The for clause prevents alerts from firing on brief spikes:

Pending: Expression is true but for duration has not elapsed
Firing: Expression has been true for the entire for duration
Resolved: Expression is no longer true

- alert: InstanceDown
  expr: up == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Instance {{ $labels.instance }} is down"

This only fires if the instance has been down for 5 consecutive minutes.

Alertmanager Configuration

Basic Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: "default"
  group_by: ["alertname", "job"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: "default"
    email_configs:
      - to: "team@example.com"

Routing Tree

Alertmanager uses a tree-based routing structure. Each alert is matched against routes from top to bottom:

route:
  receiver: "default"
  routes:
    - match:
        severity: critical
      receiver: "pagerduty"
    - match:
        severity: warning
      receiver: "slack"
    - match_re:
        service: "^(web|api)$"
      receiver: "team-backend"

Key Parameters

Parameter	Description
`group_by`	Labels to group alerts by (reduces notification noise)
`group_wait`	Wait time before sending first notification for a new group
`group_interval`	Wait time between notifications for a group
`repeat_interval`	Wait time before re-sending a notification
`continue`	If true, continue matching subsequent sibling routes

Inhibition Rules

Suppress alerts when related alerts are firing:

inhibit_rules:
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "instance"]

This suppresses warning alerts when a critical alert with the same alertname and instance is firing.

Silences

Silences mute alerts for a given time period. They are managed through the Alertmanager web UI or API, not through configuration files.

Recording Rules

Recording rules precompute frequently used or expensive expressions:

groups:
  - name: request-rates
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

Naming convention: level:metric:operations (e.g., job:http_requests_total:rate5m)

Recording rules:

Reduce query latency for dashboards
Are evaluated at the evaluation_interval
Store results as new time series

Connecting Prometheus to Alertmanager

# prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

Key Exam Tips

for clause: An alert without for fires immediately when the expression is true. With for, it first enters "pending" state.
Alertmanager grouping: group_by is critical for reducing notification noise. Group by alertname at minimum.
Recording rules vs alerting rules: Recording rules use record, alerting rules use alert. Both live in rule files.
Template variables: In annotations, use {{ $labels.labelname }} for labels and {{ $value }} for the expression value.
Resolve notifications: Alertmanager sends a resolve notification when an alert stops firing (after resolve_timeout).
Rule evaluation order: Rules within a group are evaluated sequentially. Groups are evaluated concurrently.

Alerting Rules

Architecture

Alerting Rules

Rule File Format

Rule Fields

The for Clause

Alertmanager Configuration

Basic Configuration

Routing Tree

Key Parameters

Inhibition Rules

Silences

Recording Rules

Connecting Prometheus to Alertmanager

Key Exam Tips

The `for` Clause