Monitoring with Prometheus and Grafana

This document describes the monitoring infrastructure using Prometheus for metrics collection and Grafana for visualization.

Architecture

graph TB
    subgraph "Application Services"
        BE[Backend :8000/metrics]
        CB[Chatbot :8080/metrics]
        RAG[RAG Service :8081/metrics]
    end
    
    subgraph "Monitoring Stack"
        Prometheus[Prometheus :9093]
        Grafana[Grafana :3001]
        Alert[Alertmanager :9094]
    end
    
    BE --> |scrape| Prometheus
    CB --> |scrape| Prometheus
    RAG --> |scrape| Prometheus
    Prometheus --> |query| Grafana
    Prometheus --> |alerts| Alert

Prometheus

Configuration

# prometheus.yml
global:
  scrape_interval: 15s      # How often to scrape targets
  evaluation_interval: 15s  # How often to evaluate rules

rule_files:
  - /etc/prometheus/alert_rules.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'backend-gateway'
    static_configs:
      - targets: ['backend:8000']

  - job_name: 'rag_service'
    static_configs:
      - targets: ['rag_service:8081']

  - job_name: 'chatbot'
    static_configs:
      - targets: ['tfg-chatbot:8080']

Scrape Jobs

Job Name	Target	Metrics Endpoint
`prometheus`	localhost:9090	Self-monitoring
`backend-gateway`	backend:8000	`/metrics`
`rag_service`	rag_service:8081	`/metrics`
`chatbot`	tfg-chatbot:8080	`/metrics`

Accessing Prometheus

URL: http://localhost:9093
Expression Browser: Query metrics directly
Targets: http://localhost:9093/targets (check scrape status)

Common Queries

# Service availability
up{job="backend-gateway"}

# Request rate (last 5 minutes)
rate(http_requests_total[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) * 100

# Active sessions
tfg_active_sessions_total

# Chat messages per minute
rate(tfg_chat_messages_total[1m]) * 60

Grafana

Access

URL: http://localhost:3001
Username: admin
Password: Set via GRAFANA_ADMIN_PASSWORD (default: admin)

Provisioned Datasources

Datasources are automatically configured via provisioning:

# grafana/provisioning/datasources/datasources.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    uid: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      httpMethod: POST
      manageAlerts: true
      prometheusType: Prometheus

  - name: Loki
    type: loki
    uid: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      maxLines: 1000

Provisioned Dashboards

Dashboards are automatically loaded from JSON files:

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: 'TFG Chatbot Dashboards'
    orgId: 1
    folder: 'TFG Chatbot'
    folderUid: 'tfg-chatbot'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    options:
      path: /etc/grafana/provisioning/dashboards/json

Available Dashboards

System Health Dashboard

Location: grafana/provisioning/dashboards/json/system-health.json

graph TB
    subgraph "System Health Dashboard"
        subgraph "Service Status"
            S1[Backend Status]
            S2[Chatbot Status]
            S3[RAG Status]
            S4[Prometheus Status]
        end
        
        subgraph "Request Metrics"
            R1[Request Rate]
            R2[Error Rate]
            R3[Latency P95]
        end
        
        subgraph "Resource Usage"
            U1[Memory]
            U2[CPU]
            U3[Goroutines]
        end
    end

Panels:

Service Status: UP/DOWN indicators for each service
Request Rate: Requests per second graph
Error Rate: 5xx errors over time
Latency: P50, P95, P99 latency graphs
Resource Usage: Memory and CPU per container

Logs Dashboard

Location: grafana/provisioning/dashboards/json/logs.json

Panels:

Log Volume: Logs per service over time
Log Stream: Live log viewer
Error Logs: Filtered error messages
Service Filter: Filter by container/service

Metrics Reference

Application Metrics

FastAPI services expose metrics via prometheus-fastapi-instrumentator:

Metric	Type	Description
`http_requests_total`	Counter	Total HTTP requests
`http_request_duration_seconds`	Histogram	Request latency
`http_requests_in_progress`	Gauge	Current active requests

Custom Chatbot Metrics

Metric	Type	Labels	Description
`tfg_chat_messages_total`	Counter	`role`, `subject`	Messages sent
`tfg_chat_response_time_seconds`	Histogram	`subject`	LLM response time
`tfg_test_sessions_total`	Counter	`subject`	Tests started
`tfg_rag_queries_total`	Counter	`subject`	RAG queries

Phoenix LLM Metrics

Phoenix provides additional LLM-specific observability:

Token usage per request
LLM latency breakdown
Tool call frequency
Error categorization

Access Phoenix UI at http://localhost:6006

Setting Up Alerts

Alerts are configured in alertmanager/alert_rules.yml and evaluated by Prometheus.

Service Health Alerts

groups:
  - name: service_health
    interval: 30s
    rules:
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service  is down"

Performance Alerts

      - alert: HighErrorRate
        expr: |
          (sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
           / sum(rate(http_requests_total[5m])) by (job)) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on "

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on "

Creating Custom Dashboards

1. Using Grafana UI

Navigate to http://localhost:3001
Click + → Dashboard
Add panels with Prometheus queries
Save and optionally export as JSON

2. Export to Provisioning

Create dashboard in UI
Dashboard Settings → JSON Model → Copy
Save to grafana/provisioning/dashboards/json/
Restart Grafana or wait for auto-reload

Example Panel Configuration

{
  "title": "Chat Messages Rate",
  "type": "timeseries",
  "datasource": {
    "type": "prometheus",
    "uid": "prometheus"
  },
  "targets": [
    {
      "expr": "rate(tfg_chat_messages_total[5m])",
      "legendFormat": " - "
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "reqps"
    }
  }
}

Troubleshooting

Prometheus Not Scraping

Check target status: http://localhost:9093/targets
Verify service is running: docker compose ps

Check network connectivity:

docker exec prometheus wget -O- http://backend:8000/metrics

Grafana Cannot Connect to Prometheus

Verify datasource: Configuration → Data Sources → Prometheus → Test
Check URL is http://prometheus:9090 (Docker network)
Verify Prometheus is healthy:
```
docker inspect --format='' prometheus
```

Missing Metrics

Verify metrics endpoint:
```
curl http://localhost:8000/metrics
```
Check for /metrics route in application
Ensure prometheus-fastapi-instrumentator is configured

Dashboard Not Loading

Check Grafana logs: docker compose logs grafana
Verify JSON syntax in dashboard files
Check folder permissions in container

Best Practices

Metric Naming

Follow Prometheus naming conventions:

Use _total suffix for counters
Use _seconds or _bytes for units
Use _info for metadata labels

Label Cardinality

Avoid high-cardinality labels:

❌ User IDs, session IDs, timestamps
✅ Service names, status codes, methods

Retention

Configure retention based on needs:

# In docker-compose.yml
prometheus:
  command:
    - '--storage.tsdb.retention.time=15d'
    - '--storage.tsdb.retention.size=10GB'

Alerting - Alert configuration
Logging - Loki integration
Docker Compose - Service configuration