Skip to content

Your Python Wheels Still Target 2009 CPUs

April 15, 2026·Tim Hopper

Intel shipped AVX2 in 2013. AMD reached it with Excavator-era parts around 2015. Eleven years later, the default NumPy wheel pip downloads on an x86_64 Linux box is still compiled to run on processors from roughly 2003, the year AMD launched the Opteron and defined the AMD64 baseline. Every SIMD instruction that has shipped since is off-limits to the compiler that produced that wheel.

The reason is structural. A wheel filename encodes three pieces of compatibility metadata: Python version, ABI, and platform. “Platform” for mainstream Linux wheels is typically manylinux_2_17_x86_64, which says “works on any glibc 2.17+ x86-64 distro” and nothing about which CPU instruction sets the machine supports. When a project publishes one binary for all of those users, it has to target the lowest common denominator.

Why NumPy feels fast and SciPy does not

NumPy looks like a counterexample, and for good reason. The project has engineered its way past the wheel format. It compiles its SIMD-heavy source multiple times for different CPU families (Haswell, Skylake, and so on), bundles every build into a single extension module, and runs a CPU-feature check at import time to dispatch to the right variant. Sustained engineering contributions from Intel and ARM keep the dispatcher current on each side.

SciPy, scikit-learn, pandas, and Pillow have not followed. On Talk Python #544, Ralf Gommers (co-CEO of Quansight and a NumPy-SciPy maintainer) singled out SciPy as the canonical case: AVX2 and ARM NEON code already lives in the SciPy source tree, but the project does not build or ship SIMD wheels because there is no clean way to do it. The fat-binary dispatcher that makes NumPy work is a specialist project that takes perpetual maintenance, and most scientific libraries cannot take on that cost.

The cost of being a museum

Gommers put the performance ceiling bluntly later in the same episode:

The difference between 2009 hardware features and 2019 or 2023 ones could be a factor of 10x, 20x in performance.

That is the gap a default x86_64 wheel leaves on the table on modern hardware, on any workload where SIMD matters. Scientific computing is the obvious case. So is anything that spends real time in NumPy, in image processing, in cryptographic primitives, or in JSON and compression libraries. It is also the quietest kind of regression: nothing is broken, the benchmark just never gets to run.

The same problem extends past CPU microarchitecture. PyTorch ships separate 900 MB wheels for CUDA 12.6, 12.8, 13.0, ROCm, and CPU-only, hosted on custom index URLs because the wheel format has no way to declare “needs CUDA 12.8.” RAPIDS encodes the CUDA version in the package name (cudf-cu12, cudf-cu13) to work around the same gap. Every CUDA-dependent project has invented its own distribution scheme.

Two draft PEPs move the selection into the installer

PEP 817 and PEP 825, both in Draft status as of April 2026, propose an extension called wheel variants. A package ships several builds under one name, each tagged with structured properties: x86_64 :: level :: v3, nvidia :: cuda_version_lower_bound :: 12.8. At install time, a provider plugin for each namespace queries the host machine and reports what it supports. The resolver scores each candidate wheel against those reports and installs the best match.

The upshot for a typical user: pip install numpy on a recent laptop installs an AVX2 build automatically. Users on older machines still get a working baseline wheel. pip install torch or uv add torch picks the right CUDA version with no index URL and no package-name suffix.

The design has a cost. Provider plugins execute at resolve time, which is in tension with reproducible installs driven by a lockfile. PEP 825 handles that by adding a [packages.variant_json] field to the pylock.toml format, so a lockfile can record the selected variant alongside the pinned version and hashes. A reinstall on a different machine then deterministically fetches the same wheel.

What this actually unblocks

The near-term win is that SciPy, scikit-learn, pandas, and Pillow can ship the SIMD builds their source already supports without building a NumPy-grade dispatcher. Baseline performance of the scientific Python stack on a 2023 laptop stops being bottlenecked by a compiler setting from the George W. Bush administration.

The medium-term win is that the index URL puzzle for CUDA, ROCm, and the rest of the GPU ecosystem goes away. The package index stops being a second configuration dimension. The install command stops being platform-specific.

Astral published a variant-enabled build of uv on August 13, 2025. Forks of pip, Warehouse (the software behind PyPI), setuptools, scikit-build-core, and the packaging library exist to demo the full flow end to end. The PEPs are drafts, which means they can still change substantially or be withdrawn, and full ecosystem adoption extends past 2026. The interesting part is that the work to ship a working prototype across the whole tool chain has already happened.

Learn more

Last updated on

Please submit corrections and feedback...