Skip to content

Sampling vs deterministic profilers: which should I use?

Two different designs dominate Python profiling, and they produce two different kinds of lies about your program. A deterministic profiler intercepts every function call, records the entry and exit, and tallies the total. A sampling profiler peeks at the call stack on a timer and counts how often each function shows up. Both give you a list of hot functions, and the list will not always agree.

The difference matters because the choice changes what you can trust about the output.

How a Deterministic Profiler Works

Deterministic profilers (also called tracing or instrumenting profilers) register a callback on every Python function entry and exit. The stdlib’s cProfile does this using sys.setprofile. For each call the profiler records the function identity and a timestamp from a single configurable timer (the default is time.perf_counter), so it can report exact call counts, total time, and cumulative time per function.

This design has one strong property: the numbers are exact, modulo measurement overhead. If the profile says parse_row was called 12,847 times, that is the actual call count. Call-count accuracy is how you catch unexpected recursion, repeated work, or a cache that never warms up.

The weakness is the overhead. Every single Python call, including the ones inside hot loops that each take nanoseconds, pays the cost of two profiler callbacks. Short, numerous functions get systematically over-taxed. A program that spends most of its wall time inside cheap calls can easily run 2–5x slower under cProfile, and the per-function cost distribution the profiler reports no longer matches what the program does in the wild.

How a Sampling Profiler Works

A sampling profiler does not hook anything. It wakes up on a timer (usually 100 times per second), looks at the current Python call stack, and writes it to a buffer. After N seconds the profiler aggregates the samples and reports: “90% of samples had slow_hash on the stack, so slow_hash is where the CPU time goes.”

Tools like py-spy, pyinstrument, Scalene, and austin all use this model. py-spy goes one step further and runs in a separate process, reading the Python program’s stacks via OS-level process inspection. That decouples the sampler’s own CPU cost from the profiled program’s event loop and GIL.

The strong property here is statistical: with enough samples, each function’s share of samples converges to its share of time on the call stack. Whether that share represents CPU time or wall-clock time depends on the sampler: py-spy’s default --idle=false mode and Scalene report on-CPU time, while pyinstrument and austin default to wall-clock sampling, which also counts time spent blocked on I/O or await. Either way, the measurement itself is cheap, and the overhead stays flat regardless of how many fast calls the program makes. A py-spy run against a production web worker typically costs a few percent of CPU.

The weakness is exactly call counts. A sampling profiler cannot tell you how many times parse_row was called. It can tell you that the program was inside parse_row for 34% of its samples. For very short functions that the sampler might never catch mid-execution, attribution gets noisy until you collect more samples.

When Each Is Right

A sampling profiler is the right default for most performance work. The reasons stack up:

  • It can attach to a running program without a restart, which matters for production debugging.
  • Its overhead is bounded by the sample rate, not the program’s call frequency, so it does not distort I/O-bound or microservice-heavy workloads.
  • Its output already looks like a flame graph, which is the representation most humans read.
  • It works on multi-process workloads via subprocess tracking.

A deterministic profiler earns its overhead when the questions are different:

  • “How many times does this function get called?” Sampling cannot answer this; cProfile can.
  • “Inside a single, short, reproducible benchmark, which of these two implementations has fewer calls to hash()?” cProfile answers this cleanly; sampling gives a fuzzy answer until you run the benchmark much longer.
  • “Which line of this hot function dominates the cost?” line_profiler attributes time to specific source lines; sampling profilers bottom out at the function level.

For timing a single small function down to microseconds, neither profiler is the right tool. cProfile’s per-call callback adds hundreds of nanoseconds, which swamps anything faster than roughly a microsecond per call; use timeit or pyperf for that.

For Python specifically, a reasonable workflow is: start with a sampling profiler to find the hot region, then if the hot region is a few specific functions, switch to line_profiler (deterministic, line-level) to see exactly where inside them the time goes.

What the Flame Graph Actually Shows

The flame graph a sampling profiler produces is not a call graph. The x-axis is not time. Each bar is a function, stacked vertically above its caller, and the width of the bar is the fraction of samples that saw that function on the stack. Two equally wide bars side by side do not mean “first one ran, then the other”. They mean “each accounted for this share of the total sampled time, in some order”.

That ordering difference is why a sampling profile is genuinely cheaper to collect but harder to reason about for “why was this specific request slow” questions. A flame graph tells you where time goes; a distributed trace tells you when. The two tools answer different questions and do not substitute for each other.

Default to Sampling; Reach for Tracing Narrowly

Use a sampling profiler by default. For most Python work the question is “where does this program spend its time”, and a sampling profiler answers it with bounded overhead, production safety, and flame-graph output. py-spy is the handbook’s default choice for CPU-bound work because it runs out of process and does not require code changes or a restart.

Reach for cProfile or line_profiler when the question is specifically about call counts or fine-grained per-line attribution inside a known-short benchmark. Reach for a memory profiler like memray when the problem is allocations or leaks rather than CPU time. These are complementary tools, not substitutes.

Learn More

Last updated on

Please submit corrections and feedback...