# How to Install Flash-Attention


[PyPI](https://pydevtools.com/handbook/explanation/what-is-pypi.md) has no [wheels](https://pydevtools.com/handbook/reference/wheel.md) for `flash-attn`. Every `pip install flash-attn` triggers a from-source CUDA compilation that can take over two hours (see [Why Installing GPU Python Packages Is So Complicated](https://pydevtools.com/handbook/explanation/installing-cuda-python-packages.md) for background). Prebuilt wheels do exist on GitHub Releases, and the package's `setup.py` can fetch them automatically if your environment matches.

## Requirements

- Platform: Linux on NVIDIA Ampere (A100, RTX 3090), Ada Lovelace (RTX 4090), or Hopper (H100). Windows has experimental support since v2.3.2. No macOS support.
- Software: CUDA toolkit >=12.0 with `nvcc` on `PATH`, PyTorch >=2.2 already installed in the target environment.

## Install with prebuilt wheels (recommended)

The `flash-attn` package includes a `CachedWheelsCommand` in its `setup.py` that tries to download a matching prebuilt wheel from GitHub Releases before falling back to compilation. The `--no-build-isolation` flag is required because `setup.py` imports `torch` and `packaging` at the top level. Both must be installed in the environment before running the install:

```sh
uv pip install packaging
uv pip install flash-attn --no-build-isolation
```
```sh
pip install packaging
pip install flash-attn --no-build-isolation
```
[pip](https://pydevtools.com/handbook/reference/pip.md) and [uv](https://pydevtools.com/handbook/reference/uv.md) both support `--no-build-isolation`. The flag tells the installer to use packages from the current environment during the build instead of creating an isolated one.

When a prebuilt wheel matches, installation completes in seconds. When it doesn't match, the command silently falls back to compiling from source, which takes much longer. If the install takes more than a minute, a prebuilt wheel was not found for your configuration.

### Prebuilt wheel coverage (v2.8.3)

All prebuilt wheels target CUDA 12 on Linux x86_64. Two aarch64 wheels exist for torch 2.9.

| PyTorch | Python 3.9 | Python 3.10 | Python 3.11 | Python 3.12 | Python 3.13 |
|---------|------------|-------------|-------------|-------------|-------------|
| 2.4     | yes        | yes         | yes         | yes         |             |
| 2.5     | yes        | yes         | yes         | yes         | yes         |
| 2.6     | yes        | yes         | yes         | yes         | yes         |
| 2.7     | yes        | yes         | yes         | yes         | yes         |
| 2.8     | yes        | yes         | yes         | yes         | yes         |
| 2.9     |            |             |             | yes         |             |

Each cell has both CXX11 ABI `TRUE` and `FALSE` variants. If your combination is not in this table, skip to [Build from source](#build-from-source).

> [!WARNING]
> Prebuilt wheels lag behind PyTorch releases. If you are on a newer version of PyTorch than what is listed above (e.g. torch 2.10 or 2.11), no prebuilt wheel exists and the install will fall back to compiling from source. Either pin a supported PyTorch version or follow the [Build from source](#build-from-source) instructions.

## Add to a uv project

For projects managed with [uv](https://pydevtools.com/handbook/reference/uv.md) using `uv add` and `uv sync`, the `--no-build-isolation` approach above does not apply. Instead, uv provides [`extra-build-dependencies`](https://docs.astral.sh/uv/concepts/projects/config/#augmenting-build-dependencies) to inject `torch` into the isolated build environment. The `match-runtime = true` option ensures the build uses the same torch version the project resolves at runtime:

```toml
[project]
dependencies = ["flash-attn", "torch"]

[tool.uv.extra-build-dependencies]
flash-attn = ["packaging", { requirement = "torch", match-runtime = true }]
```

Then run `uv sync` as normal. uv handles the build isolation, torch injection, and version matching automatically. The build will compile CUDA extensions from source, which requires `nvcc` on `PATH` and takes several minutes.

To control the build, pass environment variables with `extra-build-variables`. For example, to limit parallel compilation jobs on low-memory machines:

```toml
[tool.uv.extra-build-variables]
flash-attn = { MAX_JOBS = "4" }
```

## Install from a direct wheel URL

When automatic download fails or when pinning a specific wheel in a requirements file, construct the URL and install it directly. The URL pattern is:

```
https://github.com/Dao-AILab/flash-attention/releases/download/v{version}/flash_attn-{version}+cu{cuda}torch{torch}cxx11abi{abi}-cp{py}-cp{py}-linux_x86_64.whl
```

To fill in the blanks, check your environment:

```sh
python -c "import torch; print('torch:', torch.__version__[:3])"
python -c "import torch; print('cxx11abi:', torch._C._GLIBCXX_USE_CXX11_ABI)"
python -c "import sys; print('python: cp' + ''.join(map(str, sys.version_info[:2])))"
```

Then install the matching wheel. For example, with Python 3.12, PyTorch 2.7, and CXX11 ABI `True`:

```sh
uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
```
```sh
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
```
## Build from source

When no prebuilt wheel matches your environment, install the build dependencies into the same environment as PyTorch and then compile with a constrained job count:

```sh
uv pip install ninja packaging
MAX_JOBS=4 uv pip install flash-attn --no-build-isolation
```
```sh
pip install ninja packaging
MAX_JOBS=4 pip install flash-attn --no-build-isolation
```
`ninja` is critical. Without it, the build uses a single-threaded fallback that takes roughly two hours instead of minutes. `MAX_JOBS=4` prevents out-of-memory kills on machines with less than 96GB RAM; increase the number on machines with more memory to speed up the build.

See [How to Install PyTorch with uv](https://pydevtools.com/handbook/how-to/how-to-install-pytorch-with-uv.md) for getting a compatible PyTorch installation in place first.

## Install with conda-forge or pixi

[conda](https://pydevtools.com/handbook/reference/conda.md) and [pixi](https://pydevtools.com/handbook/reference/pixi.md) users can skip the wheel and compilation complexity entirely. The [conda-forge](https://pydevtools.com/handbook/reference/conda-forge.md) build handles the CUDA toolkit dependency as part of the solver, so there is no need to manage `nvcc`, ABI variants, or `--no-build-isolation`.

```sh
pixi add flash-attn
```
```sh
conda install -c conda-forge flash-attn
```
Packages are available for `linux-64` and `linux-aarch64`. For more on when conda-based tools are the better choice for GPU workloads, see [uv vs pixi vs conda for Scientific Python](https://pydevtools.com/handbook/explanation/uv-vs-pixi-vs-conda-for-scientific-python.md).

## Verify the installation

After installing, confirm flash-attention loads and can see the GPU:

```sh
python -c "import flash_attn; print(flash_attn.__version__)"
```

If this fails with `ModuleNotFoundError`, the installation did not complete. Check the install output for errors. If it fails with a CUDA-related error at import time, the installed build may not match your driver or GPU architecture.

## Troubleshooting

`ModuleNotFoundError: No module named 'packaging'` during install. The `setup.py` imports `packaging` before it does anything else, including downloading prebuilt wheels. Run `pip install packaging` first, then retry.

`ModuleNotFoundError: No module named 'torch'` during install. PyTorch must be installed in the environment before running `pip install flash-attn`. The `--no-build-isolation` flag tells pip to use the current environment's packages during the build, and `setup.py` imports `torch` immediately. Install PyTorch first, then retry.

Build starts compiling instead of downloading a wheel. Your combination of Python version, PyTorch version, or CXX11 ABI does not have a prebuilt wheel. Check the [compatibility table](#prebuilt-wheel-coverage-v283) above. If your combination is listed, check what `torch.__version__` reports. PyTorch installed from a CUDA-specific index (e.g. `download.pytorch.org/whl/cu128`) reports a version like `2.7.1+cu128`. The `+cu128` local version suffix can cause the auto-download to construct the wrong wheel URL. If this happens, use the [direct wheel URL method](#install-from-a-direct-wheel-url) instead.


Build killed by OOM. Set `MAX_JOBS=2` or `MAX_JOBS=1` to reduce parallel compilation. Each compilation job can consume several gigabytes of memory.

`nvcc` not found. The CUDA toolkit is not on `PATH`. Install it from [NVIDIA's CUDA toolkit archive](https://developer.nvidia.com/cuda-toolkit-archive) or use a Docker image with CUDA pre-installed (such as `nvidia/cuda:12.8.0-devel-ubuntu22.04`).

Build takes hours. Install `ninja` (`pip install ninja`) and retry. Without `ninja`, the CUDA extensions compile one file at a time.

> [!NOTE]
> FlashAttention-4 is a separate package (`pip install --pre flash-attn-4`) that uses JIT compilation and ships as a pure Python wheel. No CUDA compiler or `--no-build-isolation` flag needed. It requires a Hopper or Blackwell GPU and CUDA >=12.3.

## Related

Handbook articles:

- [Why Installing GPU Python Packages Is So Complicated](https://pydevtools.com/handbook/explanation/installing-cuda-python-packages.md) explains the wheel format limitations that make `flash-attn` packaging so difficult
- [How to Install PyTorch with uv](https://pydevtools.com/handbook/how-to/how-to-install-pytorch-with-uv.md) covers getting PyTorch installed before adding flash-attention
- [How to Install RAPIDS with uv](https://pydevtools.com/handbook/how-to/how-to-install-rapids-with-uv.md) covers another GPU package that requires custom index configuration
- [uv vs pixi vs conda for Scientific Python](https://pydevtools.com/handbook/explanation/uv-vs-pixi-vs-conda-for-scientific-python.md) compares tooling choices for GPU workloads

External resources:

- [flash-attention GitHub repository](https://github.com/Dao-AILab/flash-attention) for release notes and issue tracker
- [flash-attn on PyPI](https://pypi.org/project/flash-attn/) (source distributions only)
- [flash-attn on conda-forge](https://github.com/conda-forge/flash-attn-feedstock) for the conda-forge build recipe
- [v2.8.3 prebuilt wheels](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.8.3) on GitHub Releases
