gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only) by eendebakpt · Pull Request #146307 · python/cpython

eendebakpt · 2026-03-22T19:56:54Z

We can add the following tier 2 micro-ops that mutate the uniquely-referenced operand:

_BINARY_OP_ADD_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
_BINARY_OP_SUBTRACT_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
_BINARY_OP_MULTIPLY_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
_UNARY_NEGATIVE_FLOAT_INPLACE — unique operand

The _RIGHT variants handle commutative ops (add, multiply) plus subtract when only the RHS is unique. The optimizer emits these in optimizer_bytecodes.c when PyJitRef_IsUnique(left) or PyJitRef_IsUnique(right) is true and the operand is a known float. The mutated operand is marked as borrowed so the following _POP_TOP becomes _POP_TOP_NOP.

Micro-benchmarks:

Expression	main	optimized	Speedup
`total += a*b + c`	24.0 ns/iter	11.5 ns/iter	2.1x
`total += a + b`	16.9 ns/iter	11.0 ns/iter	1.5x
`total += ab + cd`	28.5 ns/iter	18.3 ns/iter	1.6x

pyperformance nbody (20k iterations):

	main	optimized	Speedup
nbody	60.6 ms	49.0 ms	1.19x (19% faster)

Followup

Some operations that could be added in followup PR's: division of floats, operations on a float and an int with a uniquely referenced float, integer operations (but this is move involved because of small ints and number of digits), operations with complex numbers.

Script

"""Demo script for the inplace float mutation optimization in the tier 2 optimizer.

./configure --enable-experimental-jit=interpreter --with-pydebug
./configure --enable-experimental-jit=yes --with-pydebug

Usage:
    ./python jit_float_demo.py           # filtered trace (float-related ops)
    ./python jit_float_demo.py --all     # full trace (all ops)
"""

import sys
import timeit

SHOW_ALL = "--all" in sys.argv

# --- System info ---
print("=" * 60)
print("CPython JIT / Tier 2 Demo")
print("=" * 60)
print(f"Python version:  {sys.version}")
print(f"Debug build:     {hasattr(sys, 'gettotalrefcount')}")
print(f"Free-threaded:   {not sys._is_gil_enabled()}")
jit_mod = getattr(sys, "_jit", None)
if jit_mod is not None:
    print(f"JIT available:   {jit_mod.is_available()}")
    print(f"JIT enabled:     {jit_mod.is_enabled()}")
else:
    print("JIT:             not compiled in")

tier2 = False
try:
    from _testinternalcapi import TIER2_THRESHOLD
    tier2 = True
    print(f"Tier 2:          enabled (threshold={TIER2_THRESHOLD})")
except (ImportError, AttributeError):
    print("Tier 2:          disabled (build without _Py_TIER2)")

print()

# --- Example functions ---

def f_adds(n, a, b, c):
    """a*b + c per iteration — the multiply result is unique and reused."""
    total = 0.0
    for i in range(n):
        total = a + b + c
    return total

def f_chain(n, a, b, c):
    """a*b + c per iteration — the multiply result is unique and reused."""
    total = 0.0
    for i in range(n):
        total += a * b + c
    return total


def f_simple_add(n, a, b):
    """a + b per iteration — no unique intermediate."""
    total = 0.0
    for i in range(n):
        total += a + b
    return total


def f_long_chain(n, a, b, c, d):
    """a*b + c*d per iteration — two unique intermediates."""
    total = 0.0
    for i in range(n):
        total += a * b + c * d
    return total

def f_negate(n, a, b):
    """a*b + c*d per iteration — two unique intermediates."""
    total = 0.0
    for i in range(n):
        total = - (a + b)
    return total


# --- Warm up to trigger tier 2 ---
LOOP = 10_000
f_adds(LOOP, 2.0, 3.0, 4.0)
f_chain(LOOP, 2.0, 3.0, 4.0)
f_simple_add(LOOP, 2.0, 3.0)
f_long_chain(LOOP, 2.0, 3.0, 4.0, 5.0)


# --- Op annotation ---

def annotate_op(name, oparg, func):
    """Return a human-readable annotation for a uop."""
    varnames = func.__code__.co_varnames
    consts = func.__code__.co_consts

    # _LOAD_FAST_BORROW_3 → local index 3
    for prefix in ("_LOAD_FAST_BORROW_", "_LOAD_FAST_", "_SWAP_FAST_"):
        if name.startswith(prefix):
            idx = int(name[len(prefix):])
            local = varnames[idx] if idx < len(varnames) else f"local{idx}"
            return local

    # _LOAD_CONST_INLINE_BORROW etc — operand is a pointer, not useful
    # but if oparg is a small index into consts, show it
    if "LOAD_CONST" in name and oparg < len(consts):
        return repr(consts[oparg])

    # Binary ops
    if "MULTIPLY" in name:
        return "*"
    if "SUBTRACT" in name:
        return "-"
    if "ADD" in name and "UNICODE" not in name:
        return "+"

    # Guards
    if name == "_GUARD_TOS_FLOAT":
        return "top is float?"
    if name == "_GUARD_NOS_FLOAT":
        return "2nd is float?"
    if name == "_GUARD_TOS_INT":
        return "top is int?"
    if name == "_GUARD_NOS_INT":
        return "2nd is int?"
    if "NOT_EXHAUSTED" in name:
        return "iter not done?"

    # Pop / cleanup
    if name == "_POP_TOP_NOP":
        return "skip (borrowed/null)"
    if name == "_POP_TOP_FLOAT":
        return "decref float"
    if name == "_POP_TOP_INT":
        return "decref int"

    # Control flow
    if name == "_JUMP_TO_TOP":
        return "loop"
    if name == "_EXIT_TRACE":
        return "exit"
    if name == "_DEOPT":
        return "deoptimize"
    if name == "_ERROR_POP_N":
        return "error handler"
    if name == "_START_EXECUTOR":
        return "trace entry"
    if name == "_MAKE_WARM":
        return "warmup counter"

    return ""


# --- Show traces ---
FILTER_KEYWORDS = (
    "FLOAT", "INPLACE", "BINARY_OP", "NOP",
    "LOAD_FAST", "LOAD_CONST", "GUARD",
)

has_get_executor = False
if tier2:
    try:
        from _opcode import get_executor
        has_get_executor = True
    except ImportError:
        pass

if has_get_executor:
    mode = "all ops" if SHOW_ALL else "float-related ops only"
    print("-" * 60)
    print(f"Tier 2 traces ({mode})")
    print("-" * 60)

    for label, func in [
        ("f_adds: total = a + b + c", f_adds),
        ("f_chain: total += a * b + c", f_chain),
        ("f_simple_add: total += a + b", f_simple_add),
        ("f_long_chain: total += a * b + c * d", f_long_chain),
    ]:
        code = func.__code__
        found = False
        for i in range(len(code.co_code) // 2):
            try:
                ex = get_executor(code, i * 2)
            except (ValueError, TypeError, RuntimeError):
                continue
            if ex is None:
                continue

            print(f"\n  {label}")
            for j, op in enumerate(ex):
                name, oparg = op[0], op[1]

                if not SHOW_ALL:
                    if not any(k in name for k in FILTER_KEYWORDS):
                        continue

                annotation = annotate_op(name, oparg, func)
                marker = " <<<" if "INPLACE" in name else ""

                if annotation:
                    print(f"    {j:3d}: {name:45s} # {annotation}{marker}")
                else:
                    print(f"    {j:3d}: {name}{marker}")
            found = True
            break

        if not found:
            print(f"\n  {label}: (no executor found)")

    print()
else:
    print("-" * 60)
    print("Tier 2 traces: skipped (tier 2 not available)")
    print("-" * 60)
    print()

# --- Benchmark ---
print("-" * 60)
print("Benchmark")
print("-" * 60)

N = 2_000_000
INNER = 1000

benchmarks = [
    ("total = a + b + c", lambda: f_adds(INNER, 2.0, 3.0, 4.0)),
    ("total += a*b + c  ", lambda: f_chain(INNER, 2.0, 3.0, 4.0)),
    ("total += a + b    ", lambda: f_simple_add(INNER, 2.0, 3.0)),
    ("total += a*b + c*d", lambda: f_long_chain(INNER, 2.0, 3.0, 4.0, 5.0)),
    ("total = - (a + b)    ", lambda: f_negate(INNER, 2.0, 3.0)),
]

for label, fn in benchmarks:
    iters = N // INNER
    t = timeit.timeit(fn, number=iters)
    ns_per = t / N * 1e9
    print(f"  {label}:  {t:.3f}s  ({ns_per:.0f} ns/iter)")

print()
print("The 'a*b + c' case benefits from _BINARY_OP_ADD_FLOAT_INPLACE:")
print("the result of a*b is uniquely referenced, so the addition")
print("mutates it in place instead of allocating a new float.")

# --- N-body benchmark from pyperformance ---
print()
print("-" * 60)
print("N-body benchmark (from pyperformance)")
print("-" * 60)

PI = 3.14159265358979323
SOLAR_MASS = 4 * PI * PI
DAYS_PER_YEAR = 365.24

def _nbody_make_system():
    bodies = [
        # sun
        ([0.0, 0.0, 0.0], [0.0, 0.0, 0.0], SOLAR_MASS),
        # jupiter
        ([4.84143144246472090e+00, -1.16032004402742839e+00, -1.03622044471123109e-01],
         [1.66007664274403694e-03*DAYS_PER_YEAR, 7.69901118419740425e-03*DAYS_PER_YEAR, -6.90460016972063023e-05*DAYS_PER_YEAR],
         9.54791938424326609e-04*SOLAR_MASS),
        # saturn
        ([8.34336671824457987e+00, 4.12479856412430479e+00, -4.03523417114321381e-01],
         [-2.76742510726862411e-03*DAYS_PER_YEAR, 4.99852801234917238e-03*DAYS_PER_YEAR, 2.30417297573763929e-05*DAYS_PER_YEAR],
         2.85885980666130812e-04*SOLAR_MASS),
        # uranus
        ([1.28943695621391310e+01, -1.51111514016986312e+01, -2.23307578892655734e-01],
         [2.96460137564761618e-03*DAYS_PER_YEAR, 2.37847173959480950e-03*DAYS_PER_YEAR, -2.96589568540237556e-05*DAYS_PER_YEAR],
         4.36624404335156298e-05*SOLAR_MASS),
        # neptune
        ([1.53796971148509165e+01, -2.59193146099879641e+01, 1.79258772950371181e-01],
         [2.68067772490389322e-03*DAYS_PER_YEAR, 1.62824170038242295e-03*DAYS_PER_YEAR, -9.51592254519715870e-05*DAYS_PER_YEAR],
         5.15138902046611451e-05*SOLAR_MASS),
    ]
    pairs = []
    for x in range(len(bodies) - 1):
        for y in bodies[x + 1:]:
            pairs.append((bodies[x], y))
    return bodies, pairs

def _nbody_advance(dt, n, bodies, pairs):
    for i in range(n):
        for (([x1, y1, z1], v1, m1), ([x2, y2, z2], v2, m2)) in pairs:
            dx = x1 - x2
            dy = y1 - y2
            dz = z1 - z2
            mag = dt * ((dx * dx + dy * dy + dz * dz) ** (-1.5))
            b1m = m1 * mag
            b2m = m2 * mag
            v1[0] -= dx * b2m
            v1[1] -= dy * b2m
            v1[2] -= dz * b2m
            v2[0] += dx * b1m
            v2[1] += dy * b1m
            v2[2] += dz * b1m
        for (r, [vx, vy, vz], m) in bodies:
            r[0] += dt * vx
            r[1] += dt * vy
            r[2] += dt * vz

def _nbody_report_energy(bodies, pairs, e=0.0):
    for (((x1, y1, z1), v1, m1), ((x2, y2, z2), v2, m2)) in pairs:
        dx = x1 - x2
        dy = y1 - y2
        dz = z1 - z2
        e -= (m1 * m2) / ((dx * dx + dy * dy + dz * dz) ** 0.5)
    for (r, [vx, vy, vz], m) in bodies:
        e += m * (vx * vx + vy * vy + vz * vz) / 2.
    return e

def _nbody_offset_momentum(ref, bodies, px=0.0, py=0.0, pz=0.0):
    for (r, [vx, vy, vz], m) in bodies:
        px -= vx * m
        py -= vy * m
        pz -= vz * m
    (r, v, m) = ref
    v[0] = px / m
    v[1] = py / m
    v[2] = pz / m

def bench_nbody(iterations=20000):
    bodies, pairs = _nbody_make_system()
    _nbody_offset_momentum(bodies[0], bodies)
    _nbody_report_energy(bodies, pairs)
    _nbody_advance(0.01, iterations, bodies, pairs)
    _nbody_report_energy(bodies, pairs)

# Warmup
bench_nbody(1000)

NBODY_RUNS = 5
t = timeit.timeit(lambda: bench_nbody(20000), number=NBODY_RUNS)
print(f"  nbody (20k iterations, {NBODY_RUNS} runs): {t/NBODY_RUNS*1000:.1f} ms/run")

Issue: Optimize float operations by mutating uniquely-referenced operands in place (JIT only) #146306

…y-referenced operands in place When the tier 2 optimizer can prove that an operand to a float operation is uniquely referenced (refcount 1), mutate it in place instead of allocating a new PyFloatObject. New tier 2 micro-ops: - _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE (unique LHS) - _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE_RIGHT (unique RHS) - _UNARY_NEGATIVE_FLOAT_INPLACE (unique operand) Speeds up the pyperformance nbody benchmark by ~19%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Avoid compound assignment (+=, -=, *=) directly on ob_fval in inplace float ops. On 32-bit Windows, this generates JIT stencils with _xmm register references that MSVC cannot parse. Instead, read into a local double, compute, and write back. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add -mno-sse to clang args for i686-pc-windows-msvc target. The COFF32 stencil converter cannot handle _xmm register references that clang emits for inline float arithmetic. Using x87 FPU instructions avoids this. SSE is optional on 32-bit x86; x87 is the baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eendebakpt · 2026-03-22T21:41:05Z

Selected pyperformance benchmarks:

### chaos ###
Mean +- std dev: 60.7 ms +- 1.5 ms -> 59.5 ms +- 1.5 ms: 1.02x faster
Not significant

### fannkuch ###
Mean +- std dev: 376 ms +- 4 ms -> 375 ms +- 3 ms: 1.00x faster
Not significant

### float ###
Mean +- std dev: 60.9 ms +- 3.0 ms -> 58.8 ms +- 2.7 ms: 1.04x faster
Significant (t=4.00)

### nbody ###
Mean +- std dev: 87.2 ms +- 1.1 ms -> 72.3 ms +- 0.4 ms: 1.21x faster
Significant (t=98.34)

### raytrace ###
Mean +- std dev: 304 ms +- 5 ms -> 299 ms +- 5 ms: 1.02x faster
Not significant

### scimark_fft ###
Mean +- std dev: 305 ms +- 1 ms -> 297 ms +- 3 ms: 1.03x faster
Significant (t=19.01)

### scimark_lu ###
Mean +- std dev: 86.5 ms +- 0.4 ms -> 84.8 ms +- 0.4 ms: 1.02x faster
Not significant

### scimark_monte_carlo ###
Mean +- std dev: 60.7 ms +- 0.7 ms -> 59.8 ms +- 0.3 ms: 1.02x faster
Not significant

### scimark_sor ###
Mean +- std dev: 101 ms +- 1 ms -> 99 ms +- 3 ms: 1.03x faster
Significant (t=7.38)

### scimark_sparse_mat_mult ###
Mean +- std dev: 5.53 ms +- 0.02 ms -> 5.14 ms +- 0.02 ms: 1.08x faster
Significant (t=130.74)

### spectral_norm ###
Mean +- std dev: 84.0 ms +- 0.6 ms -> 79.0 ms +- 0.4 ms: 1.06x faster
Significant (t=53.73)

Fidget-Spinner · 2026-03-23T05:51:44Z

Python/optimizer_bytecodes.c

-        r = right;
+        if (PyJitRef_IsUnique(left)) {
+            ADD_OP(_BINARY_OP_SUBTRACT_FLOAT_INPLACE, 0, 0);
+            l = PyJitRef_Borrow(left);


Isn't it more correct to say l = sym_new_null(ctx);? Same for below?

Fidget-Spinner · 2026-03-23T05:53:27Z

Python/bytecodes.c

+            res = left;
+            l = PyStackRef_NULL;
+            r = right;


This part is a little confusing. Could you please add a comment saying.
Just this one is enough, as the rest can be explained using the same.

Suggested change

res = left;

l = PyStackRef_NULL;

r = right;

// Transfer ownership of left to res.

// Original left is now dead.

res = left;

l = PyStackRef_NULL;

r = right;

eendebakpt requested review from Fidget-Spinner, markshannon, savannahostrowski and tomasr8 as code owners March 22, 2026 19:56

bedevere-app bot mentioned this pull request Mar 22, 2026

Optimize float operations by mutating uniquely-referenced operands in place (JIT only) #146306

Open

bedevere-app bot added the awaiting review label Mar 22, 2026

eendebakpt mentioned this pull request Mar 22, 2026

JIT: Implement unique reference tracking in Tier 2 for reference count optimizations #143414

Open

eendebakpt requested review from brandtbucher and diegorusso as code owners March 22, 2026 20:48

eendebakpt changed the title ~~Draft: gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)~~ gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only) Mar 22, 2026

Fidget-Spinner reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)#146307

gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)#146307
eendebakpt wants to merge 3 commits intopython:mainfrom
eendebakpt:jit_float_inplace_rebased

eendebakpt commented Mar 22, 2026 •

edited

Loading

Uh oh!

eendebakpt commented Mar 22, 2026

Uh oh!

Fidget-Spinner Mar 23, 2026

Uh oh!

Fidget-Spinner Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

eendebakpt commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eendebakpt commented Mar 22, 2026

Uh oh!

Fidget-Spinner Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Fidget-Spinner Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eendebakpt commented Mar 22, 2026 •

edited

Loading