Alerting with Alertmanager
This document describes the alerting infrastructure using Prometheus Alertmanager for alert routing, grouping, and notification management.
Architecture
graph LR
subgraph "Metrics & Rules"
Prometheus[Prometheus :9093]
Rules[Alert Rules]
end
subgraph "Alert Management"
AM[Alertmanager :9094]
Routing[Routing Rules]
Silences[Silences]
Inhibitions[Inhibitions]
end
subgraph "Notifications"
Console[Console/Logs]
Webhook[Webhook]
Email[Email]
Slack[Slack]
end
Prometheus --> Rules
Rules -->|firing| AM
AM --> Routing
Routing --> Console
Routing -.->|optional| Webhook
Routing -.->|optional| Email
Routing -.->|optional| Slack
Components
Prometheus Alert Rules
Alert rules are defined in alertmanager/alert_rules.yml and loaded by Prometheus.
# alert_rules.yml
groups:
- name: ServiceAlerts
rules:
# Service availability
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "The service has been down for more than 1 minute."
# High error rate
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on "
description: "Error rate is over the last 5 minutes."
# High latency
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on "
description: "P95 latency is over the last 5 minutes."
Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'console'
routes:
- match:
severity: critical
receiver: 'console'
continue: true
receivers:
- name: 'console'
# Logs to Alertmanager console/logs
# Add webhook_configs, email_configs, or slack_configs for notifications
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'job']
Alert Rules Reference
Current Alerts
| Alert | Condition | Duration | Severity |
|---|---|---|---|
ServiceDown | up == 0 | 1m | critical |
HighErrorRate | Error rate > 1% | 5m | warning |
HighLatency | P95 > 2s | 5m | warning |
Alert States
stateDiagram-v2
[*] --> Inactive
Inactive --> Pending: condition true
Pending --> Firing: duration exceeded
Pending --> Inactive: condition false
Firing --> Resolved: condition false
Resolved --> Inactive: resolve_timeout
| State | Description |
|---|---|
| Inactive | Alert condition not met |
| Pending | Condition met, waiting for for duration |
| Firing | Alert active, notifications sent |
| Resolved | Condition no longer met |
Routing Configuration
Route Matching
route:
# Root route (default)
receiver: 'default'
routes:
# Critical alerts go to on-call
- match:
severity: critical
receiver: 'on-call'
# Warning alerts during business hours
- match:
severity: warning
receiver: 'team'
active_time_intervals:
- business_hours
# Specific service routing
- match_re:
job: 'chatbot|backend'
receiver: 'app-team'
Grouping
Group related alerts to reduce noise:
route:
group_by: ['alertname', 'job', 'severity']
group_wait: 30s # Wait before first notification
group_interval: 5m # Wait before adding to group
repeat_interval: 3h # Re-notify for ongoing alerts
Notification Receivers
Console (Default)
Current setup logs to Alertmanager console:
# View alert notifications
docker compose logs alertmanager
Webhook (Custom Integration)
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://webhook-service:8080/alerts'
send_resolved: true
http_config:
basic_auth:
username: 'alertmanager'
password_file: '/etc/alertmanager/secrets/webhook_password'
receivers:
- name: 'email'
email_configs:
- to: 'team@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: '$SMTP_PASSWORD'
send_resolved: true
Slack
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
channel: '#alerts'
send_resolved: true
title: ': '
text: ''
Managing Alerts
View Active Alerts
Prometheus UI:
http://localhost:9093/alerts
Alertmanager UI:
http://localhost:9094/#/alerts
Silence Alerts
Create silences for maintenance windows:
Via UI:
- Go to http://localhost:9094/#/silences
- Click “New Silence”
- Add matchers (e.g.,
alertname="ServiceDown",job="backend") - Set duration
- Add comment
Via API:
curl -X POST http://localhost:9094/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name": "alertname", "value": "ServiceDown", "isRegex": false}
],
"startsAt": "2024-01-15T10:00:00Z",
"endsAt": "2024-01-15T12:00:00Z",
"createdBy": "admin",
"comment": "Planned maintenance"
}'
Inhibition Rules
Suppress alerts when related critical alert is firing:
inhibit_rules:
# If ServiceDown is firing, suppress HighErrorRate and HighLatency
- source_match:
alertname: ServiceDown
target_match_re:
alertname: 'HighErrorRate|HighLatency'
equal: ['job']
Custom Alert Rules
Adding New Rules
- Edit
alertmanager/alert_rules.yml - Add rule to appropriate group
- Reload Prometheus:
curl -X POST http://localhost:9093/-/reload
Rule Syntax
- alert: AlertName
expr: prometheus_query > threshold
for: duration
labels:
severity: critical|warning|info
team: backend|frontend|infra
annotations:
summary: "Short description with "
description: "Detailed description with "
runbook_url: "https://wiki.example.com/runbooks/AlertName"
Common Patterns
Memory Usage:
- alert: HighMemoryUsage
expr: |
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on "
Request Rate Drop:
- alert: LowTraffic
expr: |
rate(http_requests_total[5m]) < 1
for: 10m
labels:
severity: info
annotations:
summary: "Low traffic on "
Database Connection Issues:
- alert: DatabaseConnectionFailure
expr: |
mongodb_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "MongoDB connection lost"
Testing Alerts
Manually Trigger Alert
- Create test rule: ```yaml
- alert: TestAlert expr: vector(1) labels: severity: info annotations: summary: “Test alert” ```
-
Add to rules file and reload
-
Check alert fires in Prometheus/Alertmanager
- Remove test rule when done
Validate Rules
# Using promtool (requires Prometheus installed)
docker compose exec prometheus promtool check rules /etc/prometheus/alert_rules.yml
# Check Prometheus config
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml
Alerting on Logs
While Alertmanager works with Prometheus metrics, you can also alert on log patterns:
Using Loki Recording Rules
# In Loki config
ruler:
alertmanager_url: http://alertmanager:9094
rule_files:
- /etc/loki/rules/*.yml
Log-based Alert Rule
# loki-rules.yml
groups:
- name: LogAlerts
rules:
- alert: HighErrorLogRate
expr: |
sum(rate({project="tfg-chatbot"} |= "ERROR" [5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High rate of error logs"
Grafana Integration
Alert Annotations
View alerts as annotations on Grafana dashboards:
- Dashboard Settings → Annotations
- Add annotation query from Alertmanager
- Alerts appear as vertical lines on panels
Alert Panel
Add alert status panel:
{
"title": "Active Alerts",
"type": "alertlist",
"datasource": "Alertmanager",
"options": {
"alertInstanceLabelFilter": "job=~\"backend|chatbot|rag_service\"",
"alertName": "",
"dashboardAlerts": false,
"groupBy": ["alertname"],
"stateFilter": {
"firing": true,
"pending": true
}
}
}
Best Practices
Alert Design
- Alert on symptoms, not causes
- Good: “API error rate > 1%”
- Avoid: “CPU > 90%” (unless it causes issues)
- Include runbook links
- Add
runbook_urlannotation - Document troubleshooting steps
- Add
- Appropriate severity
critical: Immediate action requiredwarning: Investigate sooninfo: Informational, no action
- Meaningful names
- Descriptive alert names
- Include service/component
Reducing Noise
- Tune thresholds based on baseline
- Use appropriate
forduration - Group related alerts
- Implement inhibition rules
- Review and refine regularly
Troubleshooting
Alerts Not Firing
- Check rule in Prometheus:
http://localhost:9093/rules - Evaluate expression manually:
http://localhost:9093/graph?g0.expr=up==0 - Check for syntax errors:
docker compose logs prometheus | grep "rule"
Notifications Not Sent
- Check Alertmanager logs:
docker compose logs alertmanager - Verify routing matches:
http://localhost:9094/#/status - Test receiver connectivity:
curl -X POST http://localhost:9094/api/v2/alerts \ -H "Content-Type: application/json" \ -d '[{"labels":{"alertname":"test"}}]'
Related Documentation
- Monitoring - Prometheus metrics
- Logging - Loki log aggregation
- Docker Compose - Service configuration