Skip to content

How to Install DeepSpeed

DeepSpeed publishes only a source distribution on PyPI, with no prebuilt wheels. The setup.py requires both PyTorch and a CUDA toolkit to generate metadata, so even a basic install needs CUDA_HOME set and nvcc on PATH. Once installed, individual ops (fused Adam, CPU offloading, transformer kernels) are compiled on first use through PyTorch’s JIT C++ extension system. See Why Installing GPU Python Packages Is So Complicated for background.

Requirements

  • Platform: Linux (x86_64). Windows has partial support through WSL2. No macOS GPU support.
  • Software: PyTorch already installed, a C++ compiler (gcc or g++), the CUDA toolkit with nvcc on PATH, and CUDA_HOME set to the toolkit root (e.g. /usr/local/cuda).
  • System libraries: libaio-dev is required for the async I/O op used by ZeRO-Infinity and NVMe offloading. Install it with apt install libaio-dev on Debian/Ubuntu.

Install from PyPI

DeepSpeed’s setup.py imports torch at the top level, so PyTorch must be present before installation. The --no-build-isolation flag tells the installer to use the current environment’s torch instead of creating a clean build environment:

uv pip install deepspeed --no-build-isolation

This installs the Python package without compiling any CUDA kernels. Ops are compiled at first use via JIT, which adds a one-time delay (seconds to minutes depending on the op) the first time DeepSpeed runs a training job.

Pre-compile ops at install time

To avoid JIT compilation delays at runtime, set DS_BUILD_OPS=1 to compile all compatible ops during installation:

DS_BUILD_OPS=1 uv pip install deepspeed --no-build-isolation

This requires nvcc on PATH and a working C++ compiler. The build takes several minutes.

To compile only specific ops, use individual environment variables instead:

Variable Op
DS_BUILD_CPU_ADAM CPU Adam optimizer
DS_BUILD_FUSED_ADAM Fused Adam (CUDA)
DS_BUILD_AIO Async I/O for NVMe offload
DS_BUILD_TRANSFORMER_INFERENCE Transformer inference kernels
DS_BUILD_SPARSE_ATTN Sparse attention

Set any of these to 1 to pre-compile that op. For example, to compile only the fused Adam optimizer:

DS_BUILD_FUSED_ADAM=1 pip install deepspeed --no-build-isolation

Add to a uv project

For projects managed with uv using uv add and uv sync, use extra-build-dependencies to inject torch into the isolated build environment. The match-runtime = true option ensures the build uses the same torch version the project resolves at runtime:

[project]
dependencies = ["deepspeed", "torch"]

[tool.uv.extra-build-dependencies]
deepspeed = [{ requirement = "torch", match-runtime = true }]

Then run uv sync as normal. uv handles build isolation and torch injection automatically.

To pre-compile ops during the build, pass environment variables with extra-build-variables:

[tool.uv.extra-build-variables]
deepspeed = { DS_BUILD_OPS = "1" }

Install with conda-forge or pixi

DeepSpeed is available on conda-forge, though the version may lag behind PyPI. The conda-forge build handles CUDA toolkit dependencies through the solver:

pixi add deepspeed

For more on when conda-based tools are the better choice for GPU workloads, see uv vs pixi vs conda for Scientific Python.

Verify the installation

After installing, confirm DeepSpeed loads and can report on the build environment:

python -c "import deepspeed; print(deepspeed.__version__)"
ds_report

ds_report prints a table showing which ops are installed (pre-compiled) versus available for JIT compilation. If a required system library is missing, the report flags it.

Troubleshooting

CUDA_HOME does not exist, unable to compile CUDA op(s) during install. The setup.py checks for CUDA_HOME at metadata generation time, before any ops are compiled. Set the environment variable to point to your CUDA toolkit root: export CUDA_HOME=/usr/local/cuda. If using a Docker image, the NVIDIA CUDA devel images set this automatically, but slim Python images do not.

ModuleNotFoundError: No module named 'torch' during install. PyTorch must be installed before DeepSpeed. The setup.py imports torch at the top level. Install PyTorch first, then retry with --no-build-isolation.

RuntimeError: ninja is not available at runtime. DeepSpeed’s JIT compilation uses ninja as its build backend. Install it with pip install ninja or apt install ninja-build.

libaio.h: No such file or directory when building the async I/O op. Install the development headers: apt install libaio-dev on Debian/Ubuntu, or yum install libaio-devel on RHEL/CentOS.

CUDA version mismatch errors. The CUDA toolkit version used to compile ops must be compatible with the CUDA version PyTorch was built against. Check python -c "import torch; print(torch.version.cuda)" and ensure nvcc --version reports a compatible version.

error: invalid command 'bdist_wheel' during install. The wheel package is missing from the environment. Run pip install wheel first, then retry. This happens on minimal base images that don’t ship wheel by default.

DS_BUILD_OPS=1 fails on a machine without a GPU. Pre-compilation requires CUDA headers and a GPU-compatible toolchain even if no physical GPU is present. On CPU-only machines, skip DS_BUILD_OPS and let ops JIT-compile on the GPU machine at runtime.

Related

Handbook articles:

External resources:

Last updated on

Please submit corrections and feedback...