RAG Service Deployment

This document covers deploying the RAG service in Docker and production environments.

Docker Deployment

Dockerfile

The service uses a multi-stage build for efficiency:

# Stage 1: Build dependencies
FROM python:3.14-slim AS builder
WORKDIR /build
RUN pip install --no-cache-dir --upgrade pip setuptools wheel
COPY rag_service/pyproject.toml rag_service/
COPY rag_service/__init__.py rag_service/
ARG INSTALL_DEV=false
RUN if [ "$INSTALL_DEV" = "true" ]; then \
        pip install --no-cache-dir "./rag_service[dev]"; \
    else \
        pip install --no-cache-dir ./rag_service; \
    fi

# Stage 2: Runtime
FROM python:3.14-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.14/site-packages /usr/local/lib/python3.14/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY rag_service rag_service/
RUN useradd -m -u 1000 appuser && \
    mkdir -p /app/documents && \
    chown -R appuser:appuser /app
USER appuser
EXPOSE 8081
CMD ["uvicorn", "rag_service.api:app", "--host", "0.0.0.0", "--port", "8081"]

Build Commands

# Production build
docker build -f rag_service/Dockerfile -t rag-service:latest .

# Development build (with test dependencies)
docker build -f rag_service/Dockerfile \
  --build-arg INSTALL_DEV=true \
  -t rag-service:dev .

Docker Compose

Service Configuration

# docker-compose.yml
rag_service:
  build:
    context: .
    dockerfile: ./rag_service/Dockerfile
  container_name: rag_service
  environment:
    - QDRANT_HOST=qdrant
    - QDRANT_PORT=6333
    - OLLAMA_HOST=ollama
    - OLLAMA_PORT=11434
    - OLLAMA_MODEL=nomic-embed-text
    - DOCUMENTS_PATH=/app/documents
    - TOP_K_RESULTS=5
    - SIMILARITY_THRESHOLD=0.5
  ports:
    - "8081:8081"
  volumes:
    - ./rag_service/documents:/app/documents
  depends_on:
    - qdrant
    - ollama
  restart: unless-stopped
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8081/health"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 40s

Dependent Services

qdrant:
  image: qdrant/qdrant:latest
  container_name: qdrant
  ports:
    - "6333:6333"
    - "6334:6334"  # gRPC
  volumes:
    - qdrant_storage:/qdrant/storage
  restart: unless-stopped

ollama:
  image: ollama/ollama:latest
  container_name: ollama
  ports:
    - "11435:11434"
  volumes:
    - ollama_models:/root/.ollama
  restart: unless-stopped

volumes:
  qdrant_storage:
  ollama_models:

Deploy Commands

# Start all services
docker compose up -d qdrant ollama rag_service

# Initialize embedding model
docker exec ollama ollama pull nomic-embed-text

# Check status
docker compose ps
docker compose logs -f rag_service

Production Considerations

Resource Limits

rag_service:
  deploy:
    resources:
      limits:
        memory: 2G
        cpus: '1.0'
      reservations:
        memory: 512M
        cpus: '0.25'

Health Checks

The service exposes a health endpoint:

# Check health
curl http://localhost:8081/health

# Response
{
  "status": "healthy",
  "qdrant_connected": true,
  "collection": {
    "name": "academic_documents",
    "points_count": 156,
    "status": "green"
  }
}

Logging

Configure structured JSON logging for production:

# logging_config.py
import logging
import json

class JSONFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({
            "timestamp": self.formatTime(record),
            "level": record.levelname,
            "message": record.getMessage(),
            "module": record.module
        })

Volumes and Persistence

Document Storage

volumes:
  - ./documents:/app/documents        # Local dev
  - documents_data:/app/documents     # Named volume

Qdrant Data

volumes:
  - qdrant_storage:/qdrant/storage    # Vector data

Ollama Models

volumes:
  - ollama_models:/root/.ollama       # Model cache

Environment Configuration

Production Environment

# .env.production
QDRANT_HOST=qdrant
QDRANT_PORT=6333
QDRANT_COLLECTION_NAME=academic_documents

OLLAMA_HOST=ollama
OLLAMA_PORT=11434
OLLAMA_MODEL=nomic-embed-text

EMBEDDING_DIMENSION=768
TOP_K_RESULTS=5
SIMILARITY_THRESHOLD=0.5

CHUNK_SIZE=1000
CHUNK_OVERLAP=200

DOCUMENTS_PATH=/app/documents
CORS_ORIGINS=["https://your-domain.com"]

Using Environment File

rag_service:
  env_file:
    - .env.production

Scaling

Horizontal Scaling

The RAG service is stateless and can be scaled horizontally:

rag_service:
  deploy:
    replicas: 3

Load Balancing

Use Nginx or Traefik for load balancing:

upstream rag_service {
    server rag_service_1:8081;
    server rag_service_2:8081;
    server rag_service_3:8081;
}

server {
    location /rag/ {
        proxy_pass http://rag_service/;
    }
}

Qdrant Scaling

For high-volume deployments, consider Qdrant cluster mode:

qdrant:
  environment:
    - QDRANT__CLUSTER__ENABLED=true

Monitoring

Prometheus Metrics

The service exposes metrics at /metrics:

curl http://localhost:8081/metrics

Available metrics:

http_requests_total - Request count by path/status
http_request_duration_seconds - Latency histogram
http_requests_in_progress - Active requests

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'rag_service'
    static_configs:
      - targets: ['rag_service:8081']
    metrics_path: /metrics

Grafana Dashboard

Key panels:

Request rate by endpoint
Error rate
P95 latency
Active connections

Security

Non-Root User

The container runs as non-root user:

RUN useradd -m -u 1000 appuser
USER appuser

CORS Configuration

Restrict origins in production:

CORS_ORIGINS=["https://your-frontend.com"]

Network Isolation

Use Docker networks:

networks:
  backend:
    driver: bridge

services:
  rag_service:
    networks:
      - backend
  qdrant:
    networks:
      - backend

Secrets Management

Use Docker secrets for sensitive data:

secrets:
  api_key:
    file: ./secrets/api_key.txt

services:
  rag_service:
    secrets:
      - api_key

Backup and Recovery

Qdrant Backup

# Create snapshot
curl -X POST http://localhost:6333/collections/academic_documents/snapshots

# List snapshots
curl http://localhost:6333/collections/academic_documents/snapshots

# Restore from snapshot
curl -X PUT http://localhost:6333/collections/academic_documents/snapshots/recover \
  -H "Content-Type: application/json" \
  -d '{"location": "http://storage/snapshot.tar"}'

Document Backup

# Backup documents volume
docker run --rm \
  -v rag_documents:/data \
  -v $(pwd)/backup:/backup \
  alpine tar czf /backup/documents.tar.gz -C /data .

# Restore
docker run --rm \
  -v rag_documents:/data \
  -v $(pwd)/backup:/backup \
  alpine tar xzf /backup/documents.tar.gz -C /data

CI/CD Integration

GitHub Actions

# .github/workflows/build-rag.yml
name: Build RAG Service

on:
  push:
    paths:
      - 'rag_service/**'

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Build image
        run: |
          docker build -f rag_service/Dockerfile \
            -t rag-service:$ .
      
      - name: Run tests
        run: |
          docker run --rm rag-service:$ \
            pytest tests/ -m "not integration"

Image Publishing

- name: Push to registry
  run: |
    docker tag rag-service:$ \
      ghcr.io/$/rag-service:latest
    docker push ghcr.io/$/rag-service:latest

Troubleshooting

Container Won’t Start

# Check logs
docker compose logs rag_service

# Common issues:
# - Qdrant not ready: Check depends_on and health
# - Port conflict: Change port mapping
# - Missing model: Run ollama pull

Connection Issues

# Verify networking
docker exec rag_service ping qdrant
docker exec rag_service ping ollama

# Check service discovery
docker exec rag_service nslookup qdrant

Performance Issues

# Check resource usage
docker stats rag_service

# Increase memory if needed
deploy:
  resources:
    limits:
      memory: 4G

Configuration - Environment variables
Development - Local setup
Architecture - System design
Infrastructure - Full stack deployment