ADR 0007 — Use vLLM for high-throughput, low-latency inference where applicable
Status
Accepted
Context
The project may run local or hosted LLM inference for chat and batch tasks. For scenarios requiring high throughput and efficient CPU/GPU utilization, we need an inference stack optimized for serving transformer models at scale.
Decision
Adopt vLLM (where appropriate) for high-throughput or low-latency inference workloads. Use it in production-like environments or benchmarks; for simpler local development or unsupported models, use fallback runtimes (e.g., Hugging Face transformers, Ollama).
Consequences
- Pros:
- vLLM provides efficient batching, scheduling and memory management for transformer inference — improved throughput and latency for many workloads.
- Integrates with common model formats and supports GPU acceleration.
- Cons / Trade-offs:
- Additional operational complexity compared to single-process runtimes.
- Model compatibility and integration work may be required for some checkpoints.
Alternatives considered
- Hugging Face Transformers: simpler single-process inference — easier for development but less efficient at scale.
- Triton/other custom serving stacks: powerful but adds more infra complexity.
References
- https://github.com/vllm-project/vllm