Skip to content

How to Install Hugging Face Transformers with uv

The Transformers library from Hugging Face provides access to thousands of pretrained models for text, image, audio, and multimodal tasks. Installing it with uv is straightforward for CPU inference but requires PyTorch index configuration for GPU acceleration, the same CUDA routing pattern covered in How to Install PyTorch with uv.

This guide covers three install paths: CPU-only inference, GPU training and inference with CUDA, and quantized model loading with accelerate and bitsandbytes.

Install for CPU inference

For tasks that run on CPU (sentiment analysis, text generation with small models, embeddings), add Transformers to your project:

uv add transformers

This installs Transformers and its core dependencies (huggingface-hub, tokenizers, safetensors) but not PyTorch. Most inference and training features require a deep learning backend, so install PyTorch alongside it:

uv add "transformers[torch]"

The [torch] extra pulls in a CPU-compatible PyTorch build from PyPI. Verify the install:

uv run python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('uv is fast'))"

The output should show a label and confidence score:

[{'label': 'POSITIVE', 'score': 0.9789}]

Install with GPU support

GPU acceleration requires PyTorch built against the right CUDA version. PyPI’s default PyTorch wheels are CPU-only on Windows and macOS. On Linux, PyPI carries CUDA 12.8 wheels as of PyTorch 2.9.1, but your system may need a different CUDA version.

Configure CUDA in pyproject.toml

Add a PyTorch CUDA index and route GPU packages to it. This example uses CUDA 12.8:

pyproject.toml
[project]
name = "my-ml-project"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
    "transformers[torch]",
]

[[tool.uv.index]]
name = "pytorch-cu128"
url = "https://download.pytorch.org/whl/cu128"
explicit = true

[tool.uv.sources]
torch = [
  { index = "pytorch-cu128", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
]
torchvision = [
  { index = "pytorch-cu128", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
]

Then lock and sync:

uv lock
uv sync

The platform markers restrict CUDA builds to Linux and Windows. macOS falls back to PyPI’s CPU wheels because CUDA builds are not available for macOS. See How to Install PyTorch with uv for the full configuration reference, including multi-backend extras and ROCm support.

Verify GPU access

uv run python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

This should print True followed by your GPU name. If it prints False, check that your NVIDIA driver is installed (nvidia-smi) and that the PyTorch CUDA version matches your driver’s supported CUDA version.

Install GPU support without a project

For one-off experimentation without a pyproject.toml, use uv pip with --torch-backend:

uv venv --python 3.12 --seed --managed-python
source .venv/bin/activate
uv pip install "transformers[torch]" --torch-backend=auto

The --torch-backend=auto flag detects your GPU hardware and selects the matching PyTorch CUDA index. Valid values include auto, cpu, cu118, cu126, cu128, cu130, rocm6, and xpu.

Important

--torch-backend only works with uv pip commands. It does not work with uv lock, uv sync, or uv run. For project-level workflows, configure the PyTorch index in pyproject.toml as shown above.

Install extras for quantization and distributed training

Loading large models (7B+ parameters) on consumer GPUs requires quantization. The accelerate library is already a transitive dependency of transformers[torch], so adding bitsandbytes is the only extra step:

uv add bitsandbytes

With these installed, load a quantized model:

quantized_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

inputs = tokenizer("Explain virtual environments in one sentence.", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note

bitsandbytes requires a Linux system with an NVIDIA GPU. It does not support macOS or Windows natively.

accelerate enables multi-GPU and distributed training with Transformers’ Trainer class, even without quantization.

Choose the right extras

Transformers ships several optional dependency groups that pull in libraries for specific use cases:

Extra What it adds When to use it
transformers[torch] PyTorch Most NLP, vision, and generative tasks
transformers[vision] Pillow Image classification, object detection, image generation
transformers[audio] librosa, soundfile Speech recognition, audio classification
transformers[sentencepiece] sentencepiece Multilingual models (mBART, XLM-RoBERTa)
transformers[video] av, decord Video understanding models

Extras can be combined: uv add "transformers[torch,vision]" installs both PyTorch and Pillow.

Learn more

Last updated on