# How to Serve LLMs Locally with vLLM and uv


[vLLM](https://github.com/vllm-project/vllm) is a high-throughput inference engine for large language models. GPU inference runs on Linux with NVIDIA CUDA. vLLM also supports AMD ROCm and Intel XPU backends, and ships experimental CPU builds for x86, ARM, and Apple Silicon. Windows users run vLLM under [WSL 2](https://learn.microsoft.com/en-us/windows/wsl/install). This guide covers the most common path: NVIDIA GPUs on Linux, where vLLM depends on [PyTorch](https://pydevtools.com/handbook/how-to/how-to-install-pytorch-with-uv.md) with CUDA support. [uv](https://pydevtools.com/handbook/reference/uv.md)'s `--torch-backend` flag ensures PyTorch pulls from the correct CUDA index to match your driver.

## Install vLLM for quick experimentation

Create a virtual environment and install vLLM in one pass:

```bash
uv venv --python 3.12 --seed --managed-python
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
```
vLLM has no official GPU support on macOS. CPU builds are experimental and require building from source per the [vLLM CPU docs](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/). For Apple Silicon GPU acceleration, see the community [vllm-metal](https://github.com/vllm-project/vllm-metal) plugin, which routes inference through MLX. For everyday Mac development, run vLLM on a Linux box (Modal, RunPod, a local server) and call it remotely.
vLLM does not publish Windows wheels. Install [WSL 2](https://learn.microsoft.com/en-us/windows/wsl/install) and run the Linux instructions inside it:

```bash
uv venv --python 3.12 --seed --managed-python
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
```
The `--torch-backend` flag tells uv which PyTorch CUDA index to pull from. `auto` detects your system's CUDA driver and picks the matching index. You can also pass `cu130`, `cu129`, `cu126`, or `cpu` explicitly. Run `nvidia-smi` to check your driver version.

vLLM's default PyPI wheel is compiled against CUDA 13.0 (vLLM 0.20.0 and later). `--torch-backend=auto` only chooses the PyTorch index; it does not switch the vLLM wheel. On a host whose NVIDIA driver only supports CUDA 12.x (still common on T4, V100, and many older datacenter GPUs), `import vllm` works but `vllm serve` fails with `ImportError: libcudart.so.13: cannot open shared object file`.

> [!IMPORTANT]
> If `nvidia-smi` reports a CUDA 12.x driver, swap the default install for the matching `+cu129` wheel:
> ```bash
> uv pip install \
>   "https://github.com/vllm-project/vllm/releases/download/v0.21.0/vllm-0.21.0+cu129-cp38-abi3-manylinux_2_34_x86_64.whl" \
>   --torch-backend=cu129
> ```

A `+cpu` wheel is also published on the same [release page](https://github.com/vllm-project/vllm/releases/tag/v0.21.0) for non-GPU installs.

## Run without a persistent environment

For one-off model serving without creating a project, install into a temporary venv with `uv pip` and run the server directly:

```bash
uv venv --python 3.12 --seed --managed-python /tmp/vllm-env
source /tmp/vllm-env/bin/activate
uv pip install vllm --torch-backend=auto
vllm serve Qwen/Qwen2.5-1.5B-Instruct
```

> [!NOTE]
> `--torch-backend` only works with `uv pip` commands, not with `uv run` or `uv sync`. For one-off use, create a throwaway venv as shown here. Delete it when you're done with `rm -rf /tmp/vllm-env`.

## Configure a vLLM project with pyproject.toml

For reproducible deployments, configure CUDA index routing in your project file. This ensures every team member and CI runner installs the same GPU-enabled build:

```toml {filename="pyproject.toml"}
[project]
name = "my-inference-service"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
    "vllm>=0.21",
]

[[tool.uv.index]]
name = "pytorch-cu130"
url = "https://download.pytorch.org/whl/cu130"
explicit = true

[tool.uv.sources]
torch = [
  { index = "pytorch-cu130", marker = "sys_platform == 'linux'" },
]
torchvision = [
  { index = "pytorch-cu130", marker = "sys_platform == 'linux'" },
]
```

Then lock and sync:

```bash
uv lock
uv sync
```

Setting `explicit = true` prevents uv from searching the PyTorch index for unrelated packages. The platform marker restricts CUDA builds to Linux because vLLM's GPU wheels target Linux only.

For a CUDA 12.9 deployment, swapping the PyTorch index to `cu129` is not enough on its own. `vllm>=0.21` still resolves to the default CUDA 13.0 PyPI wheel, so the `vllm._C` extension will fail to load `libcudart.so.13` at runtime. Pin the matching `+cu129` build via [`tool.uv.sources`](https://docs.astral.sh/uv/concepts/projects/dependencies/#dependency-sources):

```toml {filename="pyproject.toml"}
[tool.uv.sources]
vllm = { url = "https://github.com/vllm-project/vllm/releases/download/v0.21.0/vllm-0.21.0+cu129-cp38-abi3-manylinux_2_34_x86_64.whl" }
```

## Serve a model with the OpenAI-compatible API

vLLM includes a server that exposes OpenAI-compatible endpoints. Start it with:

```bash
vllm serve Qwen/Qwen2.5-1.5B-Instruct
```

Once the server is running, query it with any OpenAI-compatible client:

```bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "The capital of France is",
    "max_tokens": 20
  }'
```

## Run offline batch inference

For processing prompts without a running server, use vLLM's Python API directly. Instruct-tuned models expect chat-formatted input, so use `llm.chat()` instead of `llm.generate()` (which passes raw text without applying the model's chat template):

```python {filename="batch_inference.py"}
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=100)

conversations = [
    [{"role": "user", "content": "Explain virtual environments in one sentence."}],
    [{"role": "user", "content": "What does uv pip install do?"}],
]

outputs = llm.chat(conversations, params)
for output in outputs:
    print(output.outputs[0].text)
```

Run it with:

```bash
uv run batch_inference.py
```

## Diagnose common errors

- **"No CUDA GPUs are available."** The system either lacks an NVIDIA GPU or the CUDA driver is not installed. Run `nvidia-smi` to verify driver availability.
- **`ImportError: libcudart.so`.** vLLM's compiled extension needs a newer CUDA runtime than the PyTorch wheels provide. Add `nvidia-cuda-runtime-cu13` (or `-cu12` if pinned to `+cu129`) as shown in the install section.
- **Out of memory on large models.** Reduce `--gpu-memory-utilization` (default `0.9`) or use `--tensor-parallel-size` to shard across multiple GPUs.

## Learn more

- [uv: A Complete Guide](https://pydevtools.com/handbook/explanation/uv-complete-guide.md) covers what uv does, how fast it is, the core workflows, and recent releases.
- [vLLM GPU installation docs](https://docs.vllm.ai/en/stable/getting_started/installation/gpu/) for NVIDIA CUDA, AMD ROCm, and Intel XPU builds
- [vLLM CPU installation docs](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) for x86, ARM, and Apple Silicon backends
- [Why Installing GPU Python Packages Is So Complicated](https://pydevtools.com/handbook/explanation/installing-cuda-python-packages.md) for background on CUDA wheel distribution
