# How to Install DeepSpeed


DeepSpeed publishes only a source distribution on [PyPI](https://pydevtools.com/handbook/explanation/what-is-pypi.md), with no prebuilt [wheels](https://pydevtools.com/handbook/reference/wheel.md). The `setup.py` requires both PyTorch and a CUDA toolkit to generate metadata, so even a basic install needs `CUDA_HOME` set and `nvcc` on `PATH`. Once installed, individual ops (fused Adam, CPU offloading, transformer kernels) are compiled on first use through PyTorch's JIT C++ extension system. See [Why Installing GPU Python Packages Is So Complicated](https://pydevtools.com/handbook/explanation/installing-cuda-python-packages.md) for background.

## Requirements

- Platform: Linux (x86_64). Windows has partial support through WSL2. No macOS GPU support.
- Software: PyTorch already installed, a C++ compiler (`gcc` or `g++`), the CUDA toolkit with `nvcc` on `PATH`, and `CUDA_HOME` set to the toolkit root (e.g. `/usr/local/cuda`).
- System libraries: `libaio-dev` is required for the async I/O op used by ZeRO-Infinity and NVMe offloading. Install it with `apt install libaio-dev` on Debian/Ubuntu.

## Install from PyPI

DeepSpeed's `setup.py` imports `torch` at the top level, so PyTorch must be present before installation. The `--no-build-isolation` flag tells the installer to use the current environment's torch instead of creating a clean build environment:

```sh
uv pip install deepspeed --no-build-isolation
```
```sh
pip install deepspeed --no-build-isolation
```
This installs the Python package without compiling any CUDA kernels. Ops are compiled at first use via JIT, which adds a one-time delay (seconds to minutes depending on the op) the first time DeepSpeed runs a training job.

## Pre-compile ops at install time

To avoid JIT compilation delays at runtime, set `DS_BUILD_OPS=1` to compile all compatible ops during installation:

```sh
DS_BUILD_OPS=1 uv pip install deepspeed --no-build-isolation
```
```sh
DS_BUILD_OPS=1 pip install deepspeed --no-build-isolation
```
This requires `nvcc` on `PATH` and a working C++ compiler. The build takes several minutes.

To compile only specific ops, use individual environment variables instead:

| Variable | Op |
|---|---|
| `DS_BUILD_CPU_ADAM` | CPU Adam optimizer |
| `DS_BUILD_FUSED_ADAM` | Fused Adam (CUDA) |
| `DS_BUILD_AIO` | Async I/O for NVMe offload |
| `DS_BUILD_TRANSFORMER_INFERENCE` | Transformer inference kernels |
| `DS_BUILD_SPARSE_ATTN` | Sparse attention |

Set any of these to `1` to pre-compile that op. For example, to compile only the fused Adam optimizer:

```sh
DS_BUILD_FUSED_ADAM=1 pip install deepspeed --no-build-isolation
```

## Add to a uv project

For projects managed with [uv](https://pydevtools.com/handbook/reference/uv.md) using `uv add` and `uv sync`, use [`extra-build-dependencies`](https://docs.astral.sh/uv/concepts/projects/config/#augmenting-build-dependencies) to inject `torch` into the isolated build environment. The `match-runtime = true` option ensures the build uses the same torch version the project resolves at runtime:

```toml
[project]
dependencies = ["deepspeed", "torch"]

[tool.uv.extra-build-dependencies]
deepspeed = [{ requirement = "torch", match-runtime = true }]
```

Then run `uv sync` as normal. uv handles build isolation and torch injection automatically.

To pre-compile ops during the build, pass environment variables with `extra-build-variables`:

```toml
[tool.uv.extra-build-variables]
deepspeed = { DS_BUILD_OPS = "1" }
```

## Install with conda-forge or pixi

DeepSpeed is available on [conda-forge](https://pydevtools.com/handbook/reference/conda-forge.md), though the version may lag behind PyPI. The conda-forge build handles CUDA toolkit dependencies through the solver:

```sh
pixi add deepspeed
```
```sh
conda install -c conda-forge deepspeed
```
For more on when conda-based tools are the better choice for GPU workloads, see [uv vs pixi vs conda for Scientific Python](https://pydevtools.com/handbook/explanation/uv-vs-pixi-vs-conda-for-scientific-python.md).

## Verify the installation

After installing, confirm DeepSpeed loads and can report on the build environment:

```sh
python -c "import deepspeed; print(deepspeed.__version__)"
ds_report
```

`ds_report` prints a table showing which ops are installed (pre-compiled) versus available for JIT compilation. If a required system library is missing, the report flags it.

## Troubleshooting

`CUDA_HOME does not exist, unable to compile CUDA op(s)` during install. The `setup.py` checks for `CUDA_HOME` at metadata generation time, before any ops are compiled. Set the environment variable to point to your CUDA toolkit root: `export CUDA_HOME=/usr/local/cuda`. If using a Docker image, the NVIDIA CUDA devel images set this automatically, but slim Python images do not.

`ModuleNotFoundError: No module named 'torch'` during install. PyTorch must be installed before DeepSpeed. The `setup.py` imports `torch` at the top level. Install PyTorch first, then retry with `--no-build-isolation`.

`RuntimeError: ninja is not available` at runtime. DeepSpeed's JIT compilation uses ninja as its build backend. Install it with `pip install ninja` or `apt install ninja-build`.

`libaio.h: No such file or directory` when building the async I/O op. Install the development headers: `apt install libaio-dev` on Debian/Ubuntu, or `yum install libaio-devel` on RHEL/CentOS.

CUDA version mismatch errors. The CUDA toolkit version used to compile ops must be compatible with the CUDA version PyTorch was built against. Check `python -c "import torch; print(torch.version.cuda)"` and ensure `nvcc --version` reports a compatible version.

`error: invalid command 'bdist_wheel'` during install. The `wheel` package is missing from the environment. Run `pip install wheel` first, then retry. This happens on minimal base images that don't ship `wheel` by default.

`DS_BUILD_OPS=1` fails on a machine without a GPU. Pre-compilation requires CUDA headers and a GPU-compatible toolchain even if no physical GPU is present. On CPU-only machines, skip `DS_BUILD_OPS` and let ops JIT-compile on the GPU machine at runtime.

## Related

Handbook articles:

- [Why Installing GPU Python Packages Is So Complicated](https://pydevtools.com/handbook/explanation/installing-cuda-python-packages.md) explains the wheel format limitations that affect DeepSpeed packaging
- [How to Install PyTorch with uv](https://pydevtools.com/handbook/how-to/how-to-install-pytorch-with-uv.md) covers getting PyTorch installed before adding DeepSpeed
- [uv vs pixi vs conda for Scientific Python](https://pydevtools.com/handbook/explanation/uv-vs-pixi-vs-conda-for-scientific-python.md) compares tooling choices for GPU workloads

External resources:

- [DeepSpeed GitHub repository](https://github.com/microsoft/DeepSpeed) for documentation and issue tracker
- [deepspeed on PyPI](https://pypi.org/project/deepspeed/) (source distributions only)
- [DeepSpeed installation guide](https://www.deepspeed.ai/getting-started/) for the official docs
