How to Serve LLMs Locally with vLLM and uv
vLLM is a high-throughput inference engine for large language models. GPU inference runs on Linux with NVIDIA CUDA. vLLM also supports AMD ROCm and Intel XPU backends, and ships experimental CPU builds for x86, ARM, and Apple Silicon. Windows users run vLLM under WSL 2. This guide covers the most common path: NVIDIA GPUs on Linux, where vLLM depends on PyTorch with CUDA support. uv’s --torch-backend flag ensures PyTorch pulls from the correct CUDA index to match your driver.
Install vLLM for quick experimentation
Create a virtual environment and install vLLM in one pass:
uv venv --python 3.12 --seed --managed-python
source .venv/bin/activate
uv pip install vllm --torch-backend=autoThe --torch-backend flag tells uv which PyTorch CUDA index to pull from. auto detects your system’s CUDA driver and picks the matching index. You can also pass cu130, cu129, cu126, or cpu explicitly. Run nvidia-smi to check your driver version.
vLLM’s default PyPI wheel is compiled against CUDA 13.0 (vLLM 0.20.0 and later). --torch-backend=auto only chooses the PyTorch index; it does not switch the vLLM wheel. On a host whose NVIDIA driver only supports CUDA 12.x (still common on T4, V100, and many older datacenter GPUs), import vllm works but vllm serve fails with ImportError: libcudart.so.13: cannot open shared object file.
Important
If nvidia-smi reports a CUDA 12.x driver, swap the default install for the matching +cu129 wheel:
uv pip install \
"https://github.com/vllm-project/vllm/releases/download/v0.21.0/vllm-0.21.0+cu129-cp38-abi3-manylinux_2_34_x86_64.whl" \
--torch-backend=cu129A +cpu wheel is also published on the same release page for non-GPU installs.
Run without a persistent environment
For one-off model serving without creating a project, install into a temporary venv with uv pip and run the server directly:
uv venv --python 3.12 --seed --managed-python /tmp/vllm-env
source /tmp/vllm-env/bin/activate
uv pip install vllm --torch-backend=auto
vllm serve Qwen/Qwen2.5-1.5B-InstructNote
--torch-backend only works with uv pip commands, not with uv run or uv sync. For one-off use, create a throwaway venv as shown here. Delete it when you’re done with rm -rf /tmp/vllm-env.
Configure a vLLM project with pyproject.toml
For reproducible deployments, configure CUDA index routing in your project file. This ensures every team member and CI runner installs the same GPU-enabled build:
[project]
name = "my-inference-service"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
"vllm>=0.21",
]
[[tool.uv.index]]
name = "pytorch-cu130"
url = "https://download.pytorch.org/whl/cu130"
explicit = true
[tool.uv.sources]
torch = [
{ index = "pytorch-cu130", marker = "sys_platform == 'linux'" },
]
torchvision = [
{ index = "pytorch-cu130", marker = "sys_platform == 'linux'" },
]Then lock and sync:
uv lock
uv syncSetting explicit = true prevents uv from searching the PyTorch index for unrelated packages. The platform marker restricts CUDA builds to Linux because vLLM’s GPU wheels target Linux only.
For a CUDA 12.9 deployment, swapping the PyTorch index to cu129 is not enough on its own. vllm>=0.21 still resolves to the default CUDA 13.0 PyPI wheel, so the vllm._C extension will fail to load libcudart.so.13 at runtime. Pin the matching +cu129 build via tool.uv.sources:
[tool.uv.sources]
vllm = { url = "https://github.com/vllm-project/vllm/releases/download/v0.21.0/vllm-0.21.0+cu129-cp38-abi3-manylinux_2_34_x86_64.whl" }Serve a model with the OpenAI-compatible API
vLLM includes a server that exposes OpenAI-compatible endpoints. Start it with:
vllm serve Qwen/Qwen2.5-1.5B-InstructOnce the server is running, query it with any OpenAI-compatible client:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "The capital of France is",
"max_tokens": 20
}'Run offline batch inference
For processing prompts without a running server, use vLLM’s Python API directly. Instruct-tuned models expect chat-formatted input, so use llm.chat() instead of llm.generate() (which passes raw text without applying the model’s chat template):
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=100)
conversations = [
[{"role": "user", "content": "Explain virtual environments in one sentence."}],
[{"role": "user", "content": "What does uv pip install do?"}],
]
outputs = llm.chat(conversations, params)
for output in outputs:
print(output.outputs[0].text)Run it with:
uv run batch_inference.pyDiagnose common errors
- “No CUDA GPUs are available.” The system either lacks an NVIDIA GPU or the CUDA driver is not installed. Run
nvidia-smito verify driver availability. ImportError: libcudart.so. vLLM’s compiled extension needs a newer CUDA runtime than the PyTorch wheels provide. Addnvidia-cuda-runtime-cu13(or-cu12if pinned to+cu129) as shown in the install section.- Out of memory on large models. Reduce
--gpu-memory-utilization(default0.9) or use--tensor-parallel-sizeto shard across multiple GPUs.
Learn more
- vLLM GPU installation docs for NVIDIA CUDA, AMD ROCm, and Intel XPU builds
- vLLM CPU installation docs for x86, ARM, and Apple Silicon backends
- Why Installing GPU Python Packages Is So Complicated for background on CUDA wheel distribution