Monitoring with Prometheus and Grafana
This document describes the monitoring infrastructure using Prometheus for metrics collection and Grafana for visualization.
Architecture
graph TB
subgraph "Application Services"
BE[Backend :8000/metrics]
CB[Chatbot :8080/metrics]
RAG[RAG Service :8081/metrics]
end
subgraph "Monitoring Stack"
Prometheus[Prometheus :9093]
Grafana[Grafana :3001]
Alert[Alertmanager :9094]
end
BE --> |scrape| Prometheus
CB --> |scrape| Prometheus
RAG --> |scrape| Prometheus
Prometheus --> |query| Grafana
Prometheus --> |alerts| Alert
Prometheus
Configuration
# prometheus.yml
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate rules
rule_files:
- /etc/prometheus/alert_rules.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'backend-gateway'
static_configs:
- targets: ['backend:8000']
- job_name: 'rag_service'
static_configs:
- targets: ['rag_service:8081']
- job_name: 'chatbot'
static_configs:
- targets: ['tfg-chatbot:8080']
Scrape Jobs
| Job Name | Target | Metrics Endpoint |
|---|---|---|
prometheus | localhost:9090 | Self-monitoring |
backend-gateway | backend:8000 | /metrics |
rag_service | rag_service:8081 | /metrics |
chatbot | tfg-chatbot:8080 | /metrics |
Accessing Prometheus
- URL: http://localhost:9093
- Expression Browser: Query metrics directly
- Targets: http://localhost:9093/targets (check scrape status)
Common Queries
# Service availability
up{job="backend-gateway"}
# Request rate (last 5 minutes)
rate(http_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Active sessions
tfg_active_sessions_total
# Chat messages per minute
rate(tfg_chat_messages_total[1m]) * 60
Grafana
Access
- URL: http://localhost:3001
- Username: admin
- Password: Set via
GRAFANA_ADMIN_PASSWORD(default:admin)
Provisioned Datasources
Datasources are automatically configured via provisioning:
# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
uid: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
httpMethod: POST
manageAlerts: true
prometheusType: Prometheus
- name: Loki
type: loki
uid: loki
access: proxy
url: http://loki:3100
jsonData:
maxLines: 1000
Provisioned Dashboards
Dashboards are automatically loaded from JSON files:
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'TFG Chatbot Dashboards'
orgId: 1
folder: 'TFG Chatbot'
folderUid: 'tfg-chatbot'
type: file
disableDeletion: false
updateIntervalSeconds: 10
options:
path: /etc/grafana/provisioning/dashboards/json
Available Dashboards
System Health Dashboard
Location: grafana/provisioning/dashboards/json/system-health.json
graph TB
subgraph "System Health Dashboard"
subgraph "Service Status"
S1[Backend Status]
S2[Chatbot Status]
S3[RAG Status]
S4[Prometheus Status]
end
subgraph "Request Metrics"
R1[Request Rate]
R2[Error Rate]
R3[Latency P95]
end
subgraph "Resource Usage"
U1[Memory]
U2[CPU]
U3[Goroutines]
end
end
Panels:
- Service Status: UP/DOWN indicators for each service
- Request Rate: Requests per second graph
- Error Rate: 5xx errors over time
- Latency: P50, P95, P99 latency graphs
- Resource Usage: Memory and CPU per container
Logs Dashboard
Location: grafana/provisioning/dashboards/json/logs.json
Panels:
- Log Volume: Logs per service over time
- Log Stream: Live log viewer
- Error Logs: Filtered error messages
- Service Filter: Filter by container/service
Metrics Reference
Application Metrics
FastAPI services expose metrics via prometheus-fastapi-instrumentator:
| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total HTTP requests |
http_request_duration_seconds | Histogram | Request latency |
http_requests_in_progress | Gauge | Current active requests |
Custom Chatbot Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
tfg_chat_messages_total | Counter | role, subject | Messages sent |
tfg_chat_response_time_seconds | Histogram | subject | LLM response time |
tfg_test_sessions_total | Counter | subject | Tests started |
tfg_rag_queries_total | Counter | subject | RAG queries |
Phoenix LLM Metrics
Phoenix provides additional LLM-specific observability:
- Token usage per request
- LLM latency breakdown
- Tool call frequency
- Error categorization
Access Phoenix UI at http://localhost:6006
Setting Up Alerts
Alerts are configured in alertmanager/alert_rules.yml and evaluated by Prometheus.
Service Health Alerts
groups:
- name: service_health
interval: 30s
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
Performance Alerts
- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job)) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on "
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on "
Creating Custom Dashboards
1. Using Grafana UI
- Navigate to http://localhost:3001
- Click + → Dashboard
- Add panels with Prometheus queries
- Save and optionally export as JSON
2. Export to Provisioning
- Create dashboard in UI
- Dashboard Settings → JSON Model → Copy
- Save to
grafana/provisioning/dashboards/json/ - Restart Grafana or wait for auto-reload
Example Panel Configuration
{
"title": "Chat Messages Rate",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"targets": [
{
"expr": "rate(tfg_chat_messages_total[5m])",
"legendFormat": " - "
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps"
}
}
}
Troubleshooting
Prometheus Not Scraping
- Check target status: http://localhost:9093/targets
- Verify service is running:
docker compose ps - Check network connectivity:
docker exec prometheus wget -O- http://backend:8000/metrics
Grafana Cannot Connect to Prometheus
- Verify datasource: Configuration → Data Sources → Prometheus → Test
- Check URL is
http://prometheus:9090(Docker network) - Verify Prometheus is healthy:
docker inspect --format='' prometheus
Missing Metrics
- Verify metrics endpoint:
curl http://localhost:8000/metrics - Check for
/metricsroute in application - Ensure prometheus-fastapi-instrumentator is configured
Dashboard Not Loading
- Check Grafana logs:
docker compose logs grafana - Verify JSON syntax in dashboard files
- Check folder permissions in container
Best Practices
Metric Naming
Follow Prometheus naming conventions:
- Use
_totalsuffix for counters - Use
_secondsor_bytesfor units - Use
_infofor metadata labels
Label Cardinality
Avoid high-cardinality labels:
- ❌ User IDs, session IDs, timestamps
- ✅ Service names, status codes, methods
Retention
Configure retention based on needs:
# In docker-compose.yml
prometheus:
command:
- '--storage.tsdb.retention.time=15d'
- '--storage.tsdb.retention.size=10GB'
Related Documentation
- Alerting - Alert configuration
- Logging - Loki integration
- Docker Compose - Service configuration