Set up a multi-GPU training environment with uv

This tutorial builds a PyTorch training project managed by uv that runs identically on one GPU or eight. You will install PyTorch with CUDA support, add Hugging Face Accelerate for distributed training, write a CIFAR-10 classifier, and launch it across multiple GPUs.

Important

Multi-GPU CUDA training requires Linux with two or more NVIDIA GPUs. This tutorial does not work on macOS (no CUDA support) or Windows (limited multi-GPU support outside WSL2).

Prerequisites

uv installed (see the installation guide)
A Linux machine with two or more NVIDIA GPUs
NVIDIA drivers installed

Run nvidia-smi to confirm your GPUs are visible:

nvidia-smi --query-gpu=index,name --format=csv,noheader

You should see one line per GPU:

0, NVIDIA Tesla T4
1, NVIDIA Tesla T4

If nvidia-smi is not found or shows no GPUs, install or update your NVIDIA drivers before continuing.

Creating the project

$ uv init --python 3.13 distributed_training
Initialized project `distributed-training` at `/path/to/distributed_training`
$ cd distributed_training

The --python 3.13 flag sets requires-python = ">=3.13" in pyproject.toml, which ensures compatibility with current PyTorch CUDA builds.

Configuring PyTorch with CUDA support

PyTorch publishes separate wheel builds for each CUDA version on its own package index. To get GPU-accelerated builds, tell uv to fetch PyTorch packages from the CUDA 12.8 index. Open pyproject.toml and add these sections after the [project] table:

pyproject.toml

[[tool.uv.index]]
name = "pytorch-cu128"
url = "https://download.pytorch.org/whl/cu128"
explicit = true

[tool.uv.sources]
torch = [{ index = "pytorch-cu128" }]
torchvision = [{ index = "pytorch-cu128" }]

Setting explicit = true prevents uv from searching this index for unrelated packages. See How to Install PyTorch with uv for other CUDA versions and cross-platform configurations with platform markers.

Now add PyTorch and Accelerate:

$ uv add torch torchvision accelerate
Resolved 56 packages in 548ms
Prepared 54 packages in 1m 15s
Installed 54 packages in 535ms
 + accelerate==1.13.0
 + torch==2.11.0+cu128
 + torchvision==0.26.0+cu128
 ...

The +cu128 suffix on torch and torchvision confirms uv pulled CUDA 12.8 builds from the PyTorch index. If you see versions without this suffix, the [tool.uv.sources] section is missing or misspelled.

Verify that PyTorch detects your GPUs:

$ uv run python -c "import torch; print(f'GPUs: {torch.cuda.device_count()}')"
GPUs: 2

The number should match what nvidia-smi reported. If it prints GPUs: 0, PyTorch installed a CPU-only build. Double-check that [tool.uv.sources] routes torch to the CUDA index.

Writing the training script

Hugging Face Accelerate wraps PyTorch’s distributed training APIs behind a few function calls. You write your training loop once, and Accelerate handles distributing data, synchronizing gradients, and assigning GPUs. The same script runs on a single GPU during development and across multiple GPUs in production.

Create train.py:

train.py

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from accelerate import Accelerator


def build_model():
    return nn.Sequential(
        nn.Conv2d(3, 32, 3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(2),
        nn.Conv2d(32, 64, 3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(2),
        nn.Flatten(),
        nn.Linear(64 * 8 * 8, 256),
        nn.ReLU(),
        nn.Linear(256, 10),
    )


def train():
    accelerator = Accelerator()

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    ])

    dataset = datasets.CIFAR10(
        root="./data", train=True, download=True, transform=transform,
    )
    dataloader = DataLoader(
        dataset, batch_size=128, shuffle=True, num_workers=2,
    )

    model = build_model()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()

    model, optimizer, dataloader = accelerator.prepare(
        model, optimizer, dataloader,
    )

    for epoch in range(3):
        model.train()
        total_loss = 0.0
        correct = 0
        total = 0

        for images, labels in dataloader:
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            accelerator.backward(loss)
            optimizer.step()

            total_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

        accuracy = 100.0 * correct / total
        avg_loss = total_loss / len(dataloader)
        accelerator.print(
            f"Epoch {epoch + 1}: loss={avg_loss:.4f}, accuracy={accuracy:.1f}%"
        )


if __name__ == "__main__":
    train()

Two patterns make this script distributed-ready:

accelerator.prepare(model, optimizer, dataloader) wraps each object for distributed execution. The model gets automatic gradient synchronization, and the dataloader splits batches across GPUs.
accelerator.backward(loss) replaces loss.backward() to handle gradient scaling. accelerator.print() only prints from the main process, preventing duplicate output when running on multiple GPUs.

When launched as a regular Python script, Accelerator() detects no distributed environment and runs everything on a single GPU. No code changes are needed to switch between single-GPU and multi-GPU execution.

Training on a single GPU

Download the CIFAR-10 dataset (~170 MB) before the first training run:

uv run python -c "from torchvision.datasets import CIFAR10; CIFAR10('./data', download=True)"

Tip

Downloading data before training avoids a race condition where multiple GPU processes try to download the same files simultaneously. Pre-fetching is a good habit for distributed training projects.

Run the training script:

uv run python train.py

Expected output after three epochs:

Epoch 1: loss=1.4009, accuracy=49.7%
Epoch 2: loss=1.0130, accuracy=64.3%
Epoch 3: loss=0.8505, accuracy=70.2%

Your numbers will differ depending on your GPU hardware. The model trains on all 50,000 CIFAR-10 images using one GPU.

Scaling to multiple GPUs

Launch the same script across two GPUs:

uv run accelerate launch --num_processes 2 train.py

Expected output:

Epoch 1: loss=1.5035, accuracy=46.1%
Epoch 2: loss=1.1370, accuracy=59.7%
Epoch 3: loss=0.9677, accuracy=65.8%

The output looks identical in structure to the single-GPU run: three lines, one per epoch, from the main process only. Behind the scenes, Accelerate launched two processes (one per GPU), split each batch across both GPUs, and synchronized gradients after each step. Each GPU processed half the data per epoch.

Note

Accuracy and loss values change slightly between single-GPU and multi-GPU runs because each GPU sees a different data subset per batch. This is expected behavior for distributed data-parallel training.

Generating a persistent Accelerate config

Instead of passing --num_processes every time you launch, generate a configuration file:

uv run accelerate config

Accelerate asks a series of questions about your hardware. For a single machine with two GPUs:

In which compute environment are you running? This machine
Which type of machine are you using? multi-GPU
How many different machines will you use? 1
How many processes in total will you use? 2
Do you wish to use mixed precision? fp16

Selecting fp16 mixed precision cuts GPU memory usage and speeds up training by performing some operations in 16-bit floating point. The Accelerator() in your script reads this config automatically.

This creates ~/.cache/huggingface/accelerate/default_config.yaml. From now on, launch without flags:

uv run accelerate launch train.py

Launching with torchrun

accelerate launch wraps PyTorch’s torchrun launcher. You can use torchrun directly:

uv run torchrun --nproc_per_node=2 train.py

--nproc_per_node sets the number of GPU processes. torchrun configures the MASTER_ADDR, MASTER_PORT, RANK, and LOCAL_RANK environment variables that PyTorch’s distributed backend requires. Accelerate’s Accelerator() detects these variables and configures itself automatically.

For single-machine multi-GPU training, both launchers produce identical results. accelerate launch adds the persistent config file and flags like --mixed_precision.

Reviewing the project structure

- pyproject.toml
- uv.lock
- train.py
- README.md
- .python-version

A collaborator can clone this project, run uv sync on their own GPU machine, and launch training with uv run accelerate launch train.py. The lockfile guarantees they get identical package versions.

Next steps

Why Installing GPU Python Packages Is So Complicated explains the index URL workaround and what wheel variants will change
Accelerate documentation for FSDP, DeepSpeed integration, and multi-node training
PyTorch Distributed Overview for the lower-level APIs that Accelerate wraps
Set Up a GPU Data Science Project with pixi for a conda-forge approach to GPU projects

Last updated on May 15, 2026

Set up a GPU data science project with pixi Set up a Python project optimized for Claude Code

Please submit corrections and feedback...