Monitoring Guide
The TFG-Chatbot uses a comprehensive observability stack with Grafana for metrics/logs and Phoenix for LLM tracing.
Monitoring Stack Overview
| Service | Purpose | Port | URL |
|---|---|---|---|
| Grafana | Dashboards & visualization | 3001 | http://localhost:3001 |
| Phoenix | LLM observability & tracing | 6006 | http://localhost:6006 |
| Prometheus | Metrics collection | 9093 | http://localhost:9093 |
| Loki | Log aggregation | 3100 | http://localhost:3100 |
| Alertmanager | Alert routing | 9094 | http://localhost:9094 |
Starting the Monitoring Stack
All monitoring services are included in docker-compose:
# Start all services including monitoring
docker compose up -d
# Verify monitoring services
docker compose ps | grep -E "grafana|phoenix|prometheus|loki"
Phoenix - LLM Observability
Phoenix (by Arize AI) provides detailed tracing for LangChain/LangGraph operations.
Accessing Phoenix
Open http://localhost:6006 in your browser.
What Phoenix Tracks
- LLM Calls: Every call to Gemini/vLLM with prompts and responses
- Chain Execution: Full LangGraph node execution traces
- Tool Calls: RAG searches, guía docente lookups, web searches
- Token Usage: Input/output token counts per request
- Latency: Time spent in each component
Phoenix Dashboard Features
- Traces View: See all LLM interactions with full context
- Spans: Drill down into individual operations
- Latency Distribution: Identify slow operations
- Error Tracking: Find failed LLM calls
Example Trace
A typical chat interaction shows:
📦 Chat Request
├── 🔗 LangGraph: should_continue
├── 🤖 LLM Call: Gemini (tool selection)
├── 🔧 Tool: rag_search
│ └── 📊 Embedding: nomic-embed-text
├── 🤖 LLM Call: Gemini (response generation)
└── ✅ Response returned
Configuration
Phoenix is configured via environment variables in docker-compose.yml:
chatbot:
environment:
PHOENIX_HOST: phoenix
PHOENIX_PORT: "6006"
PHOENIX_PROJECT_NAME: tfg-chatbot
To disable Phoenix tracing:
PHOENIX_ENABLED=false docker compose up -d chatbot
Grafana - Metrics & Logs
Grafana provides unified dashboards for system metrics and logs.
Accessing Grafana
- Open http://localhost:3001
- Login with:
- Username:
admin - Password:
admin(orGRAFANA_ADMIN_PASSWORDfrom.env)
- Username:
Pre-configured Datasources
| Datasource | Purpose |
|---|---|
| Prometheus | Service metrics (HTTP requests, latency, errors) |
| Loki | Aggregated container logs |
Available Dashboards
System Health Dashboard
Located at: Dashboards > System Health
Panels include:
- Service Status: Up/down status for all services
- Request Rate: HTTP requests per second by service
- Error Rate: 5xx errors percentage
- Latency (P95): 95th percentile response times
Logs Dashboard
Located at: Dashboards > Logs
Features:
- Log Stream: Real-time logs from all containers
- Filter by Service:
{container_name="tfg-chatbot"} - Search: Full-text search across logs
- Error Highlighting: Automatic error detection
Useful Queries
Prometheus (Metrics)
# Request rate by service
sum(rate(http_requests_total[5m])) by (job)
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job) * 100
# P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
Loki (Logs)
# All chatbot logs
{container_name="tfg-chatbot"}
# Error logs only
{container_name=~"tfg-.*"} |= "ERROR"
# Search for specific user
{container_name="tfg-gateway"} |~ "user_id=estudiante"
# JSON parsing (structured logs)
{container_name="tfg-chatbot"} | json | level="error"
Creating Custom Dashboards
- Click + → New Dashboard
- Add a panel
- Select datasource (Prometheus or Loki)
- Write your query
- Configure visualization
- Save dashboard
Prometheus - Metrics Collection
Prometheus scrapes metrics from all services.
Accessing Prometheus
Open http://localhost:9093 for the Prometheus UI.
Scraped Targets
Configured in prometheus.yml:
scrape_configs:
- job_name: 'backend-gateway'
static_configs:
- targets: ['backend:8000']
- job_name: 'rag_service'
static_configs:
- targets: ['rag_service:8081']
- job_name: 'chatbot'
static_configs:
- targets: ['tfg-chatbot:8080']
Checking Target Health
- Go to Status → Targets
- All targets should show
UPin green
Alerting
Alerts are configured in alertmanager/alert_rules.yml.
Pre-configured Alerts
| Alert | Condition | Severity |
|---|---|---|
| ServiceDown | Service unreachable for 1 min | Critical |
| HighErrorRate | Error rate > 1% for 5 min | Warning |
| HighLatency | P95 latency > 2s for 5 min | Warning |
Viewing Active Alerts
- Prometheus: Alerts tab → shows firing alerts
- Alertmanager: http://localhost:9094 → alert routing
Customizing Alerts
Edit alertmanager/alert_rules.yml:
groups:
- name: custom_alerts
rules:
- alert: ChatbotHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="chatbot"}[5m])) by (le)
) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "Chatbot latency is high"
Log Collection with Promtail
Promtail collects Docker container logs and sends them to Loki.
Configuration
See promtail-config.yml:
scrape_configs:
- job_name: containers
static_configs:
- targets:
- localhost
labels:
job: containerlogs
docker_sd_configs:
- host: unix:///var/run/docker.sock
Troubleshooting Logs
# Check Promtail is collecting logs
docker logs promtail
# Verify Loki is receiving logs
curl http://localhost:3100/ready
Monitoring Best Practices
For Development
- Use Phoenix for debugging LLM behavior
- Check Grafana logs when errors occur
- Monitor latency panels during testing
For Production
- Set up alert notifications (email, Slack)
- Create SLO dashboards (availability, latency)
- Enable long-term storage for Prometheus
- Configure log retention in Loki
Quick Health Check
# Check all monitoring services
curl -s http://localhost:9093/-/healthy && echo "Prometheus OK"
curl -s http://localhost:3100/ready && echo "Loki OK"
curl -s http://localhost:3001/api/health && echo "Grafana OK"
curl -s http://localhost:6006 > /dev/null && echo "Phoenix OK"
Troubleshooting
Phoenix not receiving traces
- Check chatbot can reach Phoenix:
docker exec tfg-chatbot curl -s http://phoenix:6006 - Verify instrumentation is enabled:
docker logs tfg-chatbot | grep -i phoenix
Grafana shows no data
- Check Prometheus targets are UP
- Verify datasource configuration in Grafana
- Ensure time range includes recent data
Logs not appearing in Loki
- Check Promtail has access to Docker socket
- Verify containers are producing logs:
docker logs tfg-chatbot --tail 10 - Test Loki query in Grafana Explore