How to Install DeepSpeed
DeepSpeed publishes only a source distribution on PyPI, with no prebuilt wheels. The setup.py requires both PyTorch and a CUDA toolkit to generate metadata, so even a basic install needs CUDA_HOME set and nvcc on PATH. Once installed, individual ops (fused Adam, CPU offloading, transformer kernels) are compiled on first use through PyTorch’s JIT C++ extension system. See Why Installing GPU Python Packages Is So Complicated for background.
Requirements
- Platform: Linux (x86_64). Windows has partial support through WSL2. No macOS GPU support.
- Software: PyTorch already installed, a C++ compiler (
gccorg++), the CUDA toolkit withnvcconPATH, andCUDA_HOMEset to the toolkit root (e.g./usr/local/cuda). - System libraries:
libaio-devis required for the async I/O op used by ZeRO-Infinity and NVMe offloading. Install it withapt install libaio-devon Debian/Ubuntu.
Install from PyPI
DeepSpeed’s setup.py imports torch at the top level, so PyTorch must be present before installation. The --no-build-isolation flag tells the installer to use the current environment’s torch instead of creating a clean build environment:
uv pip install deepspeed --no-build-isolationThis installs the Python package without compiling any CUDA kernels. Ops are compiled at first use via JIT, which adds a one-time delay (seconds to minutes depending on the op) the first time DeepSpeed runs a training job.
Pre-compile ops at install time
To avoid JIT compilation delays at runtime, set DS_BUILD_OPS=1 to compile all compatible ops during installation:
DS_BUILD_OPS=1 uv pip install deepspeed --no-build-isolationThis requires nvcc on PATH and a working C++ compiler. The build takes several minutes.
To compile only specific ops, use individual environment variables instead:
| Variable | Op |
|---|---|
DS_BUILD_CPU_ADAM |
CPU Adam optimizer |
DS_BUILD_FUSED_ADAM |
Fused Adam (CUDA) |
DS_BUILD_AIO |
Async I/O for NVMe offload |
DS_BUILD_TRANSFORMER_INFERENCE |
Transformer inference kernels |
DS_BUILD_SPARSE_ATTN |
Sparse attention |
Set any of these to 1 to pre-compile that op. For example, to compile only the fused Adam optimizer:
DS_BUILD_FUSED_ADAM=1 pip install deepspeed --no-build-isolationAdd to a uv project
For projects managed with uv using uv add and uv sync, use extra-build-dependencies to inject torch into the isolated build environment. The match-runtime = true option ensures the build uses the same torch version the project resolves at runtime:
[project]
dependencies = ["deepspeed", "torch"]
[tool.uv.extra-build-dependencies]
deepspeed = [{ requirement = "torch", match-runtime = true }]Then run uv sync as normal. uv handles build isolation and torch injection automatically.
To pre-compile ops during the build, pass environment variables with extra-build-variables:
[tool.uv.extra-build-variables]
deepspeed = { DS_BUILD_OPS = "1" }Install with conda-forge or pixi
DeepSpeed is available on conda-forge, though the version may lag behind PyPI. The conda-forge build handles CUDA toolkit dependencies through the solver:
pixi add deepspeedFor more on when conda-based tools are the better choice for GPU workloads, see uv vs pixi vs conda for Scientific Python.
Verify the installation
After installing, confirm DeepSpeed loads and can report on the build environment:
python -c "import deepspeed; print(deepspeed.__version__)"
ds_reportds_report prints a table showing which ops are installed (pre-compiled) versus available for JIT compilation. If a required system library is missing, the report flags it.
Troubleshooting
CUDA_HOME does not exist, unable to compile CUDA op(s) during install. The setup.py checks for CUDA_HOME at metadata generation time, before any ops are compiled. Set the environment variable to point to your CUDA toolkit root: export CUDA_HOME=/usr/local/cuda. If using a Docker image, the NVIDIA CUDA devel images set this automatically, but slim Python images do not.
ModuleNotFoundError: No module named 'torch' during install. PyTorch must be installed before DeepSpeed. The setup.py imports torch at the top level. Install PyTorch first, then retry with --no-build-isolation.
RuntimeError: ninja is not available at runtime. DeepSpeed’s JIT compilation uses ninja as its build backend. Install it with pip install ninja or apt install ninja-build.
libaio.h: No such file or directory when building the async I/O op. Install the development headers: apt install libaio-dev on Debian/Ubuntu, or yum install libaio-devel on RHEL/CentOS.
CUDA version mismatch errors. The CUDA toolkit version used to compile ops must be compatible with the CUDA version PyTorch was built against. Check python -c "import torch; print(torch.version.cuda)" and ensure nvcc --version reports a compatible version.
error: invalid command 'bdist_wheel' during install. The wheel package is missing from the environment. Run pip install wheel first, then retry. This happens on minimal base images that don’t ship wheel by default.
DS_BUILD_OPS=1 fails on a machine without a GPU. Pre-compilation requires CUDA headers and a GPU-compatible toolchain even if no physical GPU is present. On CPU-only machines, skip DS_BUILD_OPS and let ops JIT-compile on the GPU machine at runtime.
Related
Handbook articles:
- Why Installing GPU Python Packages Is So Complicated explains the wheel format limitations that affect DeepSpeed packaging
- How to Install PyTorch with uv covers getting PyTorch installed before adding DeepSpeed
- uv vs pixi vs conda for Scientific Python compares tooling choices for GPU workloads
External resources:
- DeepSpeed GitHub repository for documentation and issue tracker
- deepspeed on PyPI (source distributions only)
- DeepSpeed installation guide for the official docs