ADR 0008 — Use Ollama for local embedding model serving (RAG)

Status

Accepted

Context

This project uses a Retrieval-Augmented Generation (RAG) pipeline that depends on producing vector embeddings for documents and queries, indexing them in a vector store, and performing similarity search to retrieve relevant context for downstream LLM prompts.

For day-to-day development, reproducible local testing, and privacy-sensitive work, the team wants a straightforward way to run embedding-capable models locally and expose a stable API for the ingestion and query pipelines.

Ollama provides an easy-to-install local model-serving runtime (CLI + HTTP interface) that can host embedding-capable models and present a consistent endpoint for generating embeddings during development and small experiments.

Decision

Recommend using Ollama as the default local-serving runtime for embedding models used by the RAG pipeline during development and testing. Ollama will be used to:

  • Host local embedding-capable model checkpoints used during development.
  • Provide a consistent HTTP/CLI endpoint for the ingestion process and local integration tests.

Production embedding serving (for scale, throughput or managed SLAs) should use dedicated runtimes or hosted embedding services (vLLM where applicable, a managed cloud embedding API, or a dedicated embedding microservice backed by GPUs/accelerators).

Consequences

  • Pros:
    • Low friction for developers: quick local setup and consistent API for embeddings.
    • Reproducible local runs for ingestion tests and RAG experiments without relying on external APIs.
    • Helps preserve data privacy by avoiding sending sensitive documents to third-party embedding services during development.
  • Cons / Trade-offs:
    • Ollama is primarily intended as a local/dev runtime and is not a full production embedding service for high-throughput workloads.
    • Not all embedding model checkpoints may be readily available or supported; some models may require conversion or packaging.
    • Embedding throughput will depend on local hardware; CI and production should use scaled runtimes when needed.
    • API differences compared to cloud embedding providers may require adapters to keep client code portable.

Alternatives considered

  • Run embedding models directly using Hugging Face Transformers / sentence-transformers locally: offers maximum control but increases setup complexity and environment variability across developers.
  • Use cloud embedding APIs (OpenAI, Cohere, Anthropic): scalable and easy to integrate, but adds cost and sends data offsite (privacy/cost trade-offs).
  • Use vLLM, Triton, or other high-performance serving stacks for embeddings in production: better throughput and latency but more operational overhead.

Mitigations / Recommendations

  • Add a small docs/ADR/README.md (or extend existing docs) with recommended Ollama commands to run the chosen embedding model and example client calls (curl/Python) used by the ingestion pipeline.
  • Implement an adapter layer for embedding requests and responses so code can switch between Ollama, cloud APIs, or other local runtimes with minimal changes.
  • Provide CI fallbacks: for lightweight CI runners, use a small embedding model or a mocked embedding endpoint to keep tests fast and deterministic.

References

  • https://ollama.ai/