ADR 0007 — Use vLLM for high-throughput, low-latency inference where applicable

Status

Accepted

Context

The project may run local or hosted LLM inference for chat and batch tasks. For scenarios requiring high throughput and efficient CPU/GPU utilization, we need an inference stack optimized for serving transformer models at scale.

Decision

Adopt vLLM (where appropriate) for high-throughput or low-latency inference workloads. Use it in production-like environments or benchmarks; for simpler local development or unsupported models, use fallback runtimes (e.g., Hugging Face transformers, Ollama).

Consequences

Pros:
- vLLM provides efficient batching, scheduling and memory management for transformer inference — improved throughput and latency for many workloads.
- Integrates with common model formats and supports GPU acceleration.
Cons / Trade-offs:
- Additional operational complexity compared to single-process runtimes.
- Model compatibility and integration work may be required for some checkpoints.

Alternatives considered

Hugging Face Transformers: simpler single-process inference — easier for development but less efficient at scale.
Triton/other custom serving stacks: powerful but adds more infra complexity.

References

https://github.com/vllm-project/vllm