Why is Python Slow?
A Python program that does the same arithmetic as a C or Rust program runs slower for reasons baked into the language, not just into the implementation. Every attribute access and every integer addition involves work that a static compiler can erase at build time but that a Python interpreter has to do at runtime. The dynamism that forces this work is also what makes Python pleasant to use, which is why faster Python is hard rather than impossible.
This page explains the structural reasons. It is not a profiling guide. If a specific program is slow and the question is “what do I do about it,” that is a different problem with different tools.
Counting what Python does for one expression
Consider p.x * 2, where p is a small object holding a number. A C or Rust compiler turns this into roughly three CPU instructions: load x from a fixed offset in the struct, multiply by two, write the result somewhere. The struct’s layout is known at compile time, the integer type is known, and the multiply is a single machine instruction.
In Python the same expression triggers a sequence of dynamic operations. The interpreter has to find the type of p, walk the type’s MRO looking for a data descriptor for x, fall back to p.__dict__ if there isn’t one, fall back again to a non-data descriptor on the type if the dict misses, find the resulting Python integer object, unbox its underlying machine value, multiply, and box the product back into a Python integer object. Each step exists because Python promises behavior a static compiler cannot predict: p could be any type, and x could be a property, a slot, or a fallback computed by __getattr__.
Python 3.11 added a specializing adaptive interpreter that watches the types flowing through hot bytecode and swaps generic operations for type-specific ones, which is why CPython has gotten 10-25% faster across recent releases without any JIT. But specialization only removes the work that would have been redundant. It cannot remove the work the language semantics actually require, because someone, somewhere, is using a class that overrides __getattribute__.
Pointer chasing destroys memory locality
The slow part of one expression is multiplied by another structural cost: Python objects are scattered across memory. In CPython, almost all Python objects live on the heap, and even integers and floats are objects with headers rather than raw machine words. Every reference between ordinary Python objects is a pointer. A Rect containing two Point objects, each with x and y floats, is a chain of pointers that have to be followed one at a time to reach the actual numbers.
Modern CPUs are fast when the data they need is already in cache. An L1 cache hit returns in a few cycles, an L3 hit in a few dozen, and a fetch from main memory takes hundreds of cycles. A pointer dereference that lands outside the cache costs the CPU more than every other instruction in the expression combined. CPython’s “everything is an object, every reference is a pointer” model means that a simple loop over a thousand Point objects might issue thousands of cache-miss-prone loads where the equivalent C struct array would stream through memory in a few cache lines.
A compiler in principle could detect that a Rect always contains exactly two Point objects with two floats each, lay them out contiguously, and skip the pointer chasing. But that optimization would change observable behavior unless the implementation could prove the optimized path preserves object identity and mutation semantics: code that compares id(r.a) across two reads, or that mutates r.a in place, must still behave like Python.
Dynamic features carry a cost even when they are not used
The dynamism is not theoretical. In Python a class can grow new methods at runtime, an instance can override its own __class__, modules can be reloaded, and import statements resolve fresh on every call. Most production code never uses these features. But the interpreter and any compiler targeting Python have to assume they might. A function that does obj.method() cannot be compiled to a direct call, because the function’s caller might have just rebound obj.method to something else. A loop reading obj.x cannot cache the value, because reading the attribute could have side effects.
Language designers face a trilemma here. A language can be dynamic, fast, and have a simple implementation, and you can pick two. Python chose dynamic and simple, which is part of why CPython is small enough that one person can read it and learn how it works. The cost is that pure-Python performance has a ceiling no incremental engineering can break through. Removing the cost requires either accepting a much more complex implementation (the JIT path) or removing some of the dynamism (the variant-language path).
This trilemma is not a complaint. The dynamism is what enables pytest collecting tests through introspection and SQLAlchemy mapping classes to database tables with metaclasses, and those are the patterns people show up for.
Type hints do not make Python faster
A reasonable question: if every variable is annotated, can a compiler use the annotations to generate fast code? The answer is no, and the reason is that Python’s type annotations are unenforced at runtime. The interpreter ignores them.
Consider a class with annotated fields:
class Point:
x: int
y: int
p = Point()
print(p.x)This program type-checks cleanly under mypy. It also raises AttributeError at runtime, because the annotations declare types but never assign values. A type checker can flag some of these cases and miss others, and either way the runtime makes no guarantees. A compiler that trusted the annotation x: int and emitted code assuming p.x is a machine integer would be unsound, because nothing in the language stops Point.__init__ from setting self.x = "hello" or another piece of code from doing p.x = None later.
A few projects bridge this gap by treating annotations as a contract the user opts into. mypyc compiles type-annotated Python to C extensions, with the trade-off that not all dynamic features are supported and code that mypy would accept might still fail to compile. Cython does the same with its own type syntax. Both projects ship working compilers, and both are subset compilers rather than full-Python compilers, for the reason that the trilemma forces.
JIT compilers solve the dynamism problem when nothing surprising happens
PyPy demonstrated, over more than a decade, that a JIT can recover much of the performance Python’s semantics throw away. PyPy reports roughly a 3x average speedup over CPython 3.11 across its benchmark suite, with larger gains on some long-running, loop-heavy pure-Python workloads and smaller gains or slowdowns when the JIT cannot help.
The technique works by speculation. The JIT watches a hot loop, observes that an instance’s class has not changed and that the same operator keeps dispatching to the same C handler, and emits machine code that assumes those conditions hold. It also inserts a small number of guards that check the assumptions on every iteration. If a guard fires because someone monkey-patched a method or a subclass slipped in, the JIT bails out and falls back to interpreting the bytecode.
The CPython team is now doing something similar, with a different mechanism. Python 3.13 shipped an experimental copy-and-patch JIT that stitches pre-built machine-code stencils together at runtime instead of running a full optimizing compiler. The 3.15 alpha measures 5-6% faster than the standard interpreter on x86_64 Linux and 11-12% faster on macOS AArch64 across the pyperformance benchmark suite, with the team’s roadmap targeting 10% by 3.16.
JITs work, with caveats. A program that looks slightly different in a way the JIT did not anticipate can suddenly run several times slower, and the JIT typically cannot tell you why. Short-lived programs and CLIs also pay a warmup cost: they finish before the JIT has paid for itself. Long-running services with stable hot loops are where JITs earn their keep, and other workloads see more uneven results.
The C extension boundary is a wall the JIT cannot see through
Python’s other performance escape hatch is the C API, which lets NumPy, PyTorch, pandas, and most of the scientific stack do their actual computation in C, C++, or Rust. The trade-off is that the C API is opaque to any JIT. When PyPy compiles a hot loop and the loop calls into a C extension, the JIT has to assume the extension can do anything: rebind globals, call back into Python, allocate, deallocate, mutate state. The optimizer’s invariants stop at the function call.
The practical effect is that PyPy is fastest in pure Python and pays a marshalling cost at every C-extension call. Cross the boundary often enough and that cost erases most of the JIT’s gains, because PyPy’s internal object representation does not match CPython’s PyObject * shape and has to be translated on each crossing. This is the main reason PyPy lost ground as the ecosystem moved toward NumPy-shaped scientific computing: the more time a program spends in C extensions, the less the JIT can do. The HPy project is an attempt to design a faster, JIT-friendlier replacement for the C API, but adoption is slow because every existing extension is written against the old one.
Subset compilers buy speed by giving up Python features
The other response to the trilemma is to compile a smaller, statically analyzable language that happens to look like Python. Cython is the most widely deployed example: NumPy, SciPy, scikit-learn, and pandas all rely on it for hot inner loops. A .pyx file with cdef int x declarations compiles to a C extension that runs at C speed, at the cost of a separate build step and a dialect that is no longer quite Python. Numba is similar in spirit but JIT-compiles annotated Python functions on first call, specifically for numerical code that operates on NumPy arrays. mypyc takes another angle, using mypy’s existing type annotations to compile to C without a separate language.
These tools work, and most of Python’s scientific and data ecosystem is built on them. The cost is that the code you write for them is a subset of Python: dynamic features, late binding, and most of the metaprogramming patterns are off the table. The further you push for performance, the more your code looks like Java or C with Python syntax.
This is the gap that experimental projects like SPy are trying to close. SPy is a statically compiled variant of Python that ships an interpreter for the development experience and a compiler for production, with semantics designed from the start to be analyzable. Its author, Antonio Cuni (a long-time PyPy core developer), argues that most working Python programmers already write in an informal subset of the language, and that a formal, fast subset could give them most of what dynamism buys without paying for the parts they never use. SPy is research-stage. Whether the bet pays off depends on whether the language stays Pythonic enough to feel like Python instead of a different language with similar punctuation.
Use the escape hatches the ecosystem already built
Python is slow on purpose, in the sense that the slowness is the price of features the language was designed around. The ecosystem has already arranged itself around this fact. Pure Python handles code where readability and iteration speed matter more than throughput. NumPy and similar libraries handle numeric kernels. Cython, Numba, and mypyc handle inner loops that pure NumPy cannot express. The C API handles libraries that need to be written in another language entirely.
For most application code this is the right trade. A Django request handler or a data-cleaning script spends almost all of its time in database drivers and NumPy-style libraries, almost none of which are written in Python. Making the surrounding Python glue ten times faster would not make the program meaningfully faster. The cases where pure-Python speed actually matters are narrower than they look, and most of them are already served by an existing escape hatch.
For the cases where pure-Python speed does matter and no escape hatch fits, the realistic options today are PyPy (if the program lives mostly in pure Python) or one of the subset compilers (if a hot loop can be isolated and rewritten with type declarations). A future in which SPy or something like it provides a fast, Pythonic, statically compiled language is plausible but not yet ready for production code.
Python is slow because of its semantics, and the ecosystem has spent thirty years choosing between a more complex implementation (PyPy’s path) and a smaller language (Cython’s path). Whether a third option exists, a language that stays Pythonic while compiling to fast code, is still an open research question.
Learn More
- Python performance myths and fairy tales is LWN’s writeup of Antonio Cuni’s EuroPython 2025 talk on these topics
- Inside SPy, part 1: Motivations and Goals lays out the trilemma and the subset-vs-variant taxonomy in detail
- What is CPython’s JIT compiler? covers the copy-and-patch JIT that ships in Python 3.13+
- What is the GIL? explains the parallelism constraint, a separate axis from single-threaded speed
- What is Python? summarizes CPython and the alternative implementations referenced above
- PyPy is the long-running tracing JIT for full Python
- Cython, Numba, and mypyc are the compilers used most heavily in the scientific Python stack
- SPy is the statically compiled Pythonic variant under active research