Monitoring Guide

The TFG-Chatbot uses a comprehensive observability stack with Grafana for metrics/logs and Phoenix for LLM tracing.

Monitoring Stack Overview

Service	Purpose	Port	URL
Grafana	Dashboards & visualization	3001	`http://localhost:3001`
Phoenix	LLM observability & tracing	6006	`http://localhost:6006`
Prometheus	Metrics collection	9093	`http://localhost:9093`
Loki	Log aggregation	3100	`http://localhost:3100`
Alertmanager	Alert routing	9094	`http://localhost:9094`

Starting the Monitoring Stack

All monitoring services are included in docker-compose:

# Start all services including monitoring
docker compose up -d

# Verify monitoring services
docker compose ps | grep -E "grafana|phoenix|prometheus|loki"

Phoenix - LLM Observability

Phoenix (by Arize AI) provides detailed tracing for LangChain/LangGraph operations.

Accessing Phoenix

Open http://localhost:6006 in your browser.

What Phoenix Tracks

LLM Calls: Every call to Gemini/vLLM with prompts and responses
Chain Execution: Full LangGraph node execution traces
Tool Calls: RAG searches, guía docente lookups, web searches
Token Usage: Input/output token counts per request
Latency: Time spent in each component

Phoenix Dashboard Features

Traces View: See all LLM interactions with full context
Spans: Drill down into individual operations
Latency Distribution: Identify slow operations
Error Tracking: Find failed LLM calls

Example Trace

A typical chat interaction shows:

📦 Chat Request
├── 🔗 LangGraph: should_continue
├── 🤖 LLM Call: Gemini (tool selection)
├── 🔧 Tool: rag_search
│   └── 📊 Embedding: nomic-embed-text
├── 🤖 LLM Call: Gemini (response generation)
└── ✅ Response returned

Configuration

Phoenix is configured via environment variables in docker-compose.yml:

chatbot:
  environment:
    PHOENIX_HOST: phoenix
    PHOENIX_PORT: "6006"
    PHOENIX_PROJECT_NAME: tfg-chatbot

To disable Phoenix tracing:

PHOENIX_ENABLED=false docker compose up -d chatbot

Grafana - Metrics & Logs

Grafana provides unified dashboards for system metrics and logs.

Accessing Grafana

Open http://localhost:3001
Login with:
- Username: admin
- Password: admin (or GRAFANA_ADMIN_PASSWORD from .env)

Pre-configured Datasources

Datasource	Purpose
Prometheus	Service metrics (HTTP requests, latency, errors)
Loki	Aggregated container logs

Available Dashboards

System Health Dashboard

Located at: Dashboards > System Health

Panels include:

Service Status: Up/down status for all services
Request Rate: HTTP requests per second by service
Error Rate: 5xx errors percentage
Latency (P95): 95th percentile response times

Logs Dashboard

Located at: Dashboards > Logs

Features:

Log Stream: Real-time logs from all containers
Filter by Service: {container_name="tfg-chatbot"}
Search: Full-text search across logs
Error Highlighting: Automatic error detection

Useful Queries

Prometheus (Metrics)

# Request rate by service
sum(rate(http_requests_total[5m])) by (job)

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job) * 100

# P95 latency
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)

Loki (Logs)

# All chatbot logs
{container_name="tfg-chatbot"}

# Error logs only
{container_name=~"tfg-.*"} |= "ERROR"

# Search for specific user
{container_name="tfg-gateway"} |~ "user_id=estudiante"

# JSON parsing (structured logs)
{container_name="tfg-chatbot"} | json | level="error"

Creating Custom Dashboards

Click + → New Dashboard
Add a panel
Select datasource (Prometheus or Loki)
Write your query
Configure visualization
Save dashboard

Prometheus - Metrics Collection

Prometheus scrapes metrics from all services.

Accessing Prometheus

Open http://localhost:9093 for the Prometheus UI.

Scraped Targets

Configured in prometheus.yml:

scrape_configs:
  - job_name: 'backend-gateway'
    static_configs:
      - targets: ['backend:8000']

  - job_name: 'rag_service'
    static_configs:
      - targets: ['rag_service:8081']

  - job_name: 'chatbot'
    static_configs:
      - targets: ['tfg-chatbot:8080']

Checking Target Health

Go to Status → Targets
All targets should show UP in green

Alerting

Alerts are configured in alertmanager/alert_rules.yml.

Pre-configured Alerts

Alert	Condition	Severity
ServiceDown	Service unreachable for 1 min	Critical
HighErrorRate	Error rate > 1% for 5 min	Warning
HighLatency	P95 latency > 2s for 5 min	Warning

Viewing Active Alerts

Prometheus: Alerts tab → shows firing alerts
Alertmanager: http://localhost:9094 → alert routing

Customizing Alerts

Edit alertmanager/alert_rules.yml:

groups:
  - name: custom_alerts
    rules:
      - alert: ChatbotHighLatency
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket{job="chatbot"}[5m])) by (le)
          ) > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Chatbot latency is high"

Log Collection with Promtail

Promtail collects Docker container logs and sends them to Loki.

Configuration

See promtail-config.yml:

scrape_configs:
  - job_name: containers
    static_configs:
      - targets:
          - localhost
        labels:
          job: containerlogs
    docker_sd_configs:
      - host: unix:///var/run/docker.sock

Troubleshooting Logs

# Check Promtail is collecting logs
docker logs promtail

# Verify Loki is receiving logs
curl http://localhost:3100/ready

Monitoring Best Practices

For Development

Use Phoenix for debugging LLM behavior
Check Grafana logs when errors occur
Monitor latency panels during testing

For Production

Set up alert notifications (email, Slack)
Create SLO dashboards (availability, latency)
Enable long-term storage for Prometheus
Configure log retention in Loki

Quick Health Check

# Check all monitoring services
curl -s http://localhost:9093/-/healthy && echo "Prometheus OK"
curl -s http://localhost:3100/ready && echo "Loki OK"
curl -s http://localhost:3001/api/health && echo "Grafana OK"
curl -s http://localhost:6006 > /dev/null && echo "Phoenix OK"

Troubleshooting

Phoenix not receiving traces

Check chatbot can reach Phoenix:

docker exec tfg-chatbot curl -s http://phoenix:6006

Verify instrumentation is enabled:

docker logs tfg-chatbot | grep -i phoenix

Grafana shows no data

Check Prometheus targets are UP
Verify datasource configuration in Grafana
Ensure time range includes recent data

Logs not appearing in Loki

Check Promtail has access to Docker socket
Verify containers are producing logs:
```
docker logs tfg-chatbot --tail 10
```
Test Loki query in Grafana Explore