Skip to content

How to Serve LLMs Locally with vLLM and uv

vLLM is a high-throughput inference engine for large language models. GPU inference runs on Linux with NVIDIA CUDA. vLLM also supports AMD ROCm and Intel XPU backends, and ships experimental CPU builds for x86, ARM, and Apple Silicon. Windows users run vLLM under WSL 2. This guide covers the most common path: NVIDIA GPUs on Linux, where vLLM depends on PyTorch with CUDA support. uv’s --torch-backend flag ensures PyTorch pulls from the correct CUDA index to match your driver.

Install vLLM for quick experimentation

Create a virtual environment and install vLLM in one pass:

uv venv --python 3.12 --seed --managed-python
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

The --torch-backend flag tells uv which PyTorch CUDA index to pull from. auto detects your system’s CUDA driver and picks the matching index. You can also pass cu130, cu129, cu126, or cpu explicitly. Run nvidia-smi to check your driver version.

vLLM’s default PyPI wheel is compiled against CUDA 13.0 (vLLM 0.20.0 and later). --torch-backend=auto only chooses the PyTorch index; it does not switch the vLLM wheel. On a host whose NVIDIA driver only supports CUDA 12.x (still common on T4, V100, and many older datacenter GPUs), import vllm works but vllm serve fails with ImportError: libcudart.so.13: cannot open shared object file.

Important

If nvidia-smi reports a CUDA 12.x driver, swap the default install for the matching +cu129 wheel:

uv pip install \
  "https://github.com/vllm-project/vllm/releases/download/v0.21.0/vllm-0.21.0+cu129-cp38-abi3-manylinux_2_34_x86_64.whl" \
  --torch-backend=cu129

A +cpu wheel is also published on the same release page for non-GPU installs.

Run without a persistent environment

For one-off model serving without creating a project, install into a temporary venv with uv pip and run the server directly:

uv venv --python 3.12 --seed --managed-python /tmp/vllm-env
source /tmp/vllm-env/bin/activate
uv pip install vllm --torch-backend=auto
vllm serve Qwen/Qwen2.5-1.5B-Instruct

Note

--torch-backend only works with uv pip commands, not with uv run or uv sync. For one-off use, create a throwaway venv as shown here. Delete it when you’re done with rm -rf /tmp/vllm-env.

Configure a vLLM project with pyproject.toml

For reproducible deployments, configure CUDA index routing in your project file. This ensures every team member and CI runner installs the same GPU-enabled build:

pyproject.toml
[project]
name = "my-inference-service"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
    "vllm>=0.21",
]

[[tool.uv.index]]
name = "pytorch-cu130"
url = "https://download.pytorch.org/whl/cu130"
explicit = true

[tool.uv.sources]
torch = [
  { index = "pytorch-cu130", marker = "sys_platform == 'linux'" },
]
torchvision = [
  { index = "pytorch-cu130", marker = "sys_platform == 'linux'" },
]

Then lock and sync:

uv lock
uv sync

Setting explicit = true prevents uv from searching the PyTorch index for unrelated packages. The platform marker restricts CUDA builds to Linux because vLLM’s GPU wheels target Linux only.

For a CUDA 12.9 deployment, swapping the PyTorch index to cu129 is not enough on its own. vllm>=0.21 still resolves to the default CUDA 13.0 PyPI wheel, so the vllm._C extension will fail to load libcudart.so.13 at runtime. Pin the matching +cu129 build via tool.uv.sources:

pyproject.toml
[tool.uv.sources]
vllm = { url = "https://github.com/vllm-project/vllm/releases/download/v0.21.0/vllm-0.21.0+cu129-cp38-abi3-manylinux_2_34_x86_64.whl" }

Serve a model with the OpenAI-compatible API

vLLM includes a server that exposes OpenAI-compatible endpoints. Start it with:

vllm serve Qwen/Qwen2.5-1.5B-Instruct

Once the server is running, query it with any OpenAI-compatible client:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "The capital of France is",
    "max_tokens": 20
  }'

Run offline batch inference

For processing prompts without a running server, use vLLM’s Python API directly. Instruct-tuned models expect chat-formatted input, so use llm.chat() instead of llm.generate() (which passes raw text without applying the model’s chat template):

batch_inference.py
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=100)

conversations = [
    [{"role": "user", "content": "Explain virtual environments in one sentence."}],
    [{"role": "user", "content": "What does uv pip install do?"}],
]

outputs = llm.chat(conversations, params)
for output in outputs:
    print(output.outputs[0].text)

Run it with:

uv run batch_inference.py

Diagnose common errors

  • “No CUDA GPUs are available.” The system either lacks an NVIDIA GPU or the CUDA driver is not installed. Run nvidia-smi to verify driver availability.
  • ImportError: libcudart.so. vLLM’s compiled extension needs a newer CUDA runtime than the PyTorch wheels provide. Add nvidia-cuda-runtime-cu13 (or -cu12 if pinned to +cu129) as shown in the install section.
  • Out of memory on large models. Reduce --gpu-memory-utilization (default 0.9) or use --tensor-parallel-size to shard across multiple GPUs.

Learn more

Last updated on