ADR 0008 — Use Ollama for local embedding model serving (RAG)

Status

Accepted

Context

This project uses a Retrieval-Augmented Generation (RAG) pipeline that depends on producing vector embeddings for documents and queries, indexing them in a vector store, and performing similarity search to retrieve relevant context for downstream LLM prompts.

For day-to-day development, reproducible local testing, and privacy-sensitive work, the team wants a straightforward way to run embedding-capable models locally and expose a stable API for the ingestion and query pipelines.

Ollama provides an easy-to-install local model-serving runtime (CLI + HTTP interface) that can host embedding-capable models and present a consistent endpoint for generating embeddings during development and small experiments.

Decision

Recommend using Ollama as the default local-serving runtime for embedding models used by the RAG pipeline during development and testing. Ollama will be used to:

Host local embedding-capable model checkpoints used during development.
Provide a consistent HTTP/CLI endpoint for the ingestion process and local integration tests.

Production embedding serving (for scale, throughput or managed SLAs) should use dedicated runtimes or hosted embedding services (vLLM where applicable, a managed cloud embedding API, or a dedicated embedding microservice backed by GPUs/accelerators).

Consequences

Pros:
- Low friction for developers: quick local setup and consistent API for embeddings.
- Reproducible local runs for ingestion tests and RAG experiments without relying on external APIs.
- Helps preserve data privacy by avoiding sending sensitive documents to third-party embedding services during development.
Cons / Trade-offs:
- Ollama is primarily intended as a local/dev runtime and is not a full production embedding service for high-throughput workloads.
- Not all embedding model checkpoints may be readily available or supported; some models may require conversion or packaging.
- Embedding throughput will depend on local hardware; CI and production should use scaled runtimes when needed.
- API differences compared to cloud embedding providers may require adapters to keep client code portable.

Alternatives considered

Run embedding models directly using Hugging Face Transformers / sentence-transformers locally: offers maximum control but increases setup complexity and environment variability across developers.
Use cloud embedding APIs (OpenAI, Cohere, Anthropic): scalable and easy to integrate, but adds cost and sends data offsite (privacy/cost trade-offs).
Use vLLM, Triton, or other high-performance serving stacks for embeddings in production: better throughput and latency but more operational overhead.

Mitigations / Recommendations

Add a small docs/ADR/README.md (or extend existing docs) with recommended Ollama commands to run the chosen embedding model and example client calls (curl/Python) used by the ingestion pipeline.
Implement an adapter layer for embedding requests and responses so code can switch between Ollama, cloud APIs, or other local runtimes with minimal changes.
Provide CI fallbacks: for lightweight CI runners, use a small embedding model or a mocked embedding endpoint to keep tests fast and deterministic.

References

https://ollama.ai/