Skip to content

gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)#146307

Open
eendebakpt wants to merge 3 commits intopython:mainfrom
eendebakpt:jit_float_inplace_rebased
Open

gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)#146307
eendebakpt wants to merge 3 commits intopython:mainfrom
eendebakpt:jit_float_inplace_rebased

Conversation

@eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Mar 22, 2026

We can add the following tier 2 micro-ops that mutate the uniquely-referenced operand:

  • _BINARY_OP_ADD_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
  • _BINARY_OP_SUBTRACT_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
  • _BINARY_OP_MULTIPLY_FLOAT_INPLACE / _INPLACE_RIGHT — unique LHS / RHS
  • _UNARY_NEGATIVE_FLOAT_INPLACE — unique operand

The _RIGHT variants handle commutative ops (add, multiply) plus subtract when only the RHS is unique. The optimizer emits these in optimizer_bytecodes.c when PyJitRef_IsUnique(left) or PyJitRef_IsUnique(right) is true and the operand is a known float. The mutated operand is marked as borrowed so the following _POP_TOP becomes _POP_TOP_NOP.

Micro-benchmarks:

Expression main optimized Speedup
total += a*b + c 24.0 ns/iter 11.5 ns/iter 2.1x
total += a + b 16.9 ns/iter 11.0 ns/iter 1.5x
total += a*b + c*d 28.5 ns/iter 18.3 ns/iter 1.6x

pyperformance nbody (20k iterations):

main optimized Speedup
nbody 60.6 ms 49.0 ms 1.19x (19% faster)

Followup

Some operations that could be added in followup PR's: division of floats, operations on a float and an int with a uniquely referenced float, integer operations (but this is move involved because of small ints and number of digits), operations with complex numbers.

Script
"""Demo script for the inplace float mutation optimization in the tier 2 optimizer.

./configure --enable-experimental-jit=interpreter --with-pydebug
./configure --enable-experimental-jit=yes --with-pydebug

Usage:
    ./python jit_float_demo.py           # filtered trace (float-related ops)
    ./python jit_float_demo.py --all     # full trace (all ops)
"""

import sys
import timeit

SHOW_ALL = "--all" in sys.argv

# --- System info ---
print("=" * 60)
print("CPython JIT / Tier 2 Demo")
print("=" * 60)
print(f"Python version:  {sys.version}")
print(f"Debug build:     {hasattr(sys, 'gettotalrefcount')}")
print(f"Free-threaded:   {not sys._is_gil_enabled()}")
jit_mod = getattr(sys, "_jit", None)
if jit_mod is not None:
    print(f"JIT available:   {jit_mod.is_available()}")
    print(f"JIT enabled:     {jit_mod.is_enabled()}")
else:
    print("JIT:             not compiled in")

tier2 = False
try:
    from _testinternalcapi import TIER2_THRESHOLD
    tier2 = True
    print(f"Tier 2:          enabled (threshold={TIER2_THRESHOLD})")
except (ImportError, AttributeError):
    print("Tier 2:          disabled (build without _Py_TIER2)")

print()

# --- Example functions ---

def f_adds(n, a, b, c):
    """a*b + c per iteration — the multiply result is unique and reused."""
    total = 0.0
    for i in range(n):
        total = a + b + c
    return total

def f_chain(n, a, b, c):
    """a*b + c per iteration — the multiply result is unique and reused."""
    total = 0.0
    for i in range(n):
        total += a * b + c
    return total


def f_simple_add(n, a, b):
    """a + b per iteration — no unique intermediate."""
    total = 0.0
    for i in range(n):
        total += a + b
    return total


def f_long_chain(n, a, b, c, d):
    """a*b + c*d per iteration — two unique intermediates."""
    total = 0.0
    for i in range(n):
        total += a * b + c * d
    return total

def f_negate(n, a, b):
    """a*b + c*d per iteration — two unique intermediates."""
    total = 0.0
    for i in range(n):
        total = - (a + b)
    return total


# --- Warm up to trigger tier 2 ---
LOOP = 10_000
f_adds(LOOP, 2.0, 3.0, 4.0)
f_chain(LOOP, 2.0, 3.0, 4.0)
f_simple_add(LOOP, 2.0, 3.0)
f_long_chain(LOOP, 2.0, 3.0, 4.0, 5.0)


# --- Op annotation ---

def annotate_op(name, oparg, func):
    """Return a human-readable annotation for a uop."""
    varnames = func.__code__.co_varnames
    consts = func.__code__.co_consts

    # _LOAD_FAST_BORROW_3 → local index 3
    for prefix in ("_LOAD_FAST_BORROW_", "_LOAD_FAST_", "_SWAP_FAST_"):
        if name.startswith(prefix):
            idx = int(name[len(prefix):])
            local = varnames[idx] if idx < len(varnames) else f"local{idx}"
            return local

    # _LOAD_CONST_INLINE_BORROW etc — operand is a pointer, not useful
    # but if oparg is a small index into consts, show it
    if "LOAD_CONST" in name and oparg < len(consts):
        return repr(consts[oparg])

    # Binary ops
    if "MULTIPLY" in name:
        return "*"
    if "SUBTRACT" in name:
        return "-"
    if "ADD" in name and "UNICODE" not in name:
        return "+"

    # Guards
    if name == "_GUARD_TOS_FLOAT":
        return "top is float?"
    if name == "_GUARD_NOS_FLOAT":
        return "2nd is float?"
    if name == "_GUARD_TOS_INT":
        return "top is int?"
    if name == "_GUARD_NOS_INT":
        return "2nd is int?"
    if "NOT_EXHAUSTED" in name:
        return "iter not done?"

    # Pop / cleanup
    if name == "_POP_TOP_NOP":
        return "skip (borrowed/null)"
    if name == "_POP_TOP_FLOAT":
        return "decref float"
    if name == "_POP_TOP_INT":
        return "decref int"

    # Control flow
    if name == "_JUMP_TO_TOP":
        return "loop"
    if name == "_EXIT_TRACE":
        return "exit"
    if name == "_DEOPT":
        return "deoptimize"
    if name == "_ERROR_POP_N":
        return "error handler"
    if name == "_START_EXECUTOR":
        return "trace entry"
    if name == "_MAKE_WARM":
        return "warmup counter"

    return ""


# --- Show traces ---
FILTER_KEYWORDS = (
    "FLOAT", "INPLACE", "BINARY_OP", "NOP",
    "LOAD_FAST", "LOAD_CONST", "GUARD",
)

has_get_executor = False
if tier2:
    try:
        from _opcode import get_executor
        has_get_executor = True
    except ImportError:
        pass

if has_get_executor:
    mode = "all ops" if SHOW_ALL else "float-related ops only"
    print("-" * 60)
    print(f"Tier 2 traces ({mode})")
    print("-" * 60)

    for label, func in [
        ("f_adds: total = a + b + c", f_adds),
        ("f_chain: total += a * b + c", f_chain),
        ("f_simple_add: total += a + b", f_simple_add),
        ("f_long_chain: total += a * b + c * d", f_long_chain),
    ]:
        code = func.__code__
        found = False
        for i in range(len(code.co_code) // 2):
            try:
                ex = get_executor(code, i * 2)
            except (ValueError, TypeError, RuntimeError):
                continue
            if ex is None:
                continue

            print(f"\n  {label}")
            for j, op in enumerate(ex):
                name, oparg = op[0], op[1]

                if not SHOW_ALL:
                    if not any(k in name for k in FILTER_KEYWORDS):
                        continue

                annotation = annotate_op(name, oparg, func)
                marker = " <<<" if "INPLACE" in name else ""

                if annotation:
                    print(f"    {j:3d}: {name:45s} # {annotation}{marker}")
                else:
                    print(f"    {j:3d}: {name}{marker}")
            found = True
            break

        if not found:
            print(f"\n  {label}: (no executor found)")

    print()
else:
    print("-" * 60)
    print("Tier 2 traces: skipped (tier 2 not available)")
    print("-" * 60)
    print()

# --- Benchmark ---
print("-" * 60)
print("Benchmark")
print("-" * 60)

N = 2_000_000
INNER = 1000

benchmarks = [
    ("total = a + b + c", lambda: f_adds(INNER, 2.0, 3.0, 4.0)),
    ("total += a*b + c  ", lambda: f_chain(INNER, 2.0, 3.0, 4.0)),
    ("total += a + b    ", lambda: f_simple_add(INNER, 2.0, 3.0)),
    ("total += a*b + c*d", lambda: f_long_chain(INNER, 2.0, 3.0, 4.0, 5.0)),
    ("total = - (a + b)    ", lambda: f_negate(INNER, 2.0, 3.0)),
]

for label, fn in benchmarks:
    iters = N // INNER
    t = timeit.timeit(fn, number=iters)
    ns_per = t / N * 1e9
    print(f"  {label}:  {t:.3f}s  ({ns_per:.0f} ns/iter)")

print()
print("The 'a*b + c' case benefits from _BINARY_OP_ADD_FLOAT_INPLACE:")
print("the result of a*b is uniquely referenced, so the addition")
print("mutates it in place instead of allocating a new float.")

# --- N-body benchmark from pyperformance ---
print()
print("-" * 60)
print("N-body benchmark (from pyperformance)")
print("-" * 60)

PI = 3.14159265358979323
SOLAR_MASS = 4 * PI * PI
DAYS_PER_YEAR = 365.24

def _nbody_make_system():
    bodies = [
        # sun
        ([0.0, 0.0, 0.0], [0.0, 0.0, 0.0], SOLAR_MASS),
        # jupiter
        ([4.84143144246472090e+00, -1.16032004402742839e+00, -1.03622044471123109e-01],
         [1.66007664274403694e-03*DAYS_PER_YEAR, 7.69901118419740425e-03*DAYS_PER_YEAR, -6.90460016972063023e-05*DAYS_PER_YEAR],
         9.54791938424326609e-04*SOLAR_MASS),
        # saturn
        ([8.34336671824457987e+00, 4.12479856412430479e+00, -4.03523417114321381e-01],
         [-2.76742510726862411e-03*DAYS_PER_YEAR, 4.99852801234917238e-03*DAYS_PER_YEAR, 2.30417297573763929e-05*DAYS_PER_YEAR],
         2.85885980666130812e-04*SOLAR_MASS),
        # uranus
        ([1.28943695621391310e+01, -1.51111514016986312e+01, -2.23307578892655734e-01],
         [2.96460137564761618e-03*DAYS_PER_YEAR, 2.37847173959480950e-03*DAYS_PER_YEAR, -2.96589568540237556e-05*DAYS_PER_YEAR],
         4.36624404335156298e-05*SOLAR_MASS),
        # neptune
        ([1.53796971148509165e+01, -2.59193146099879641e+01, 1.79258772950371181e-01],
         [2.68067772490389322e-03*DAYS_PER_YEAR, 1.62824170038242295e-03*DAYS_PER_YEAR, -9.51592254519715870e-05*DAYS_PER_YEAR],
         5.15138902046611451e-05*SOLAR_MASS),
    ]
    pairs = []
    for x in range(len(bodies) - 1):
        for y in bodies[x + 1:]:
            pairs.append((bodies[x], y))
    return bodies, pairs

def _nbody_advance(dt, n, bodies, pairs):
    for i in range(n):
        for (([x1, y1, z1], v1, m1), ([x2, y2, z2], v2, m2)) in pairs:
            dx = x1 - x2
            dy = y1 - y2
            dz = z1 - z2
            mag = dt * ((dx * dx + dy * dy + dz * dz) ** (-1.5))
            b1m = m1 * mag
            b2m = m2 * mag
            v1[0] -= dx * b2m
            v1[1] -= dy * b2m
            v1[2] -= dz * b2m
            v2[0] += dx * b1m
            v2[1] += dy * b1m
            v2[2] += dz * b1m
        for (r, [vx, vy, vz], m) in bodies:
            r[0] += dt * vx
            r[1] += dt * vy
            r[2] += dt * vz

def _nbody_report_energy(bodies, pairs, e=0.0):
    for (((x1, y1, z1), v1, m1), ((x2, y2, z2), v2, m2)) in pairs:
        dx = x1 - x2
        dy = y1 - y2
        dz = z1 - z2
        e -= (m1 * m2) / ((dx * dx + dy * dy + dz * dz) ** 0.5)
    for (r, [vx, vy, vz], m) in bodies:
        e += m * (vx * vx + vy * vy + vz * vz) / 2.
    return e

def _nbody_offset_momentum(ref, bodies, px=0.0, py=0.0, pz=0.0):
    for (r, [vx, vy, vz], m) in bodies:
        px -= vx * m
        py -= vy * m
        pz -= vz * m
    (r, v, m) = ref
    v[0] = px / m
    v[1] = py / m
    v[2] = pz / m

def bench_nbody(iterations=20000):
    bodies, pairs = _nbody_make_system()
    _nbody_offset_momentum(bodies[0], bodies)
    _nbody_report_energy(bodies, pairs)
    _nbody_advance(0.01, iterations, bodies, pairs)
    _nbody_report_energy(bodies, pairs)

# Warmup
bench_nbody(1000)

NBODY_RUNS = 5
t = timeit.timeit(lambda: bench_nbody(20000), number=NBODY_RUNS)
print(f"  nbody (20k iterations, {NBODY_RUNS} runs): {t/NBODY_RUNS*1000:.1f} ms/run")

…y-referenced operands in place

When the tier 2 optimizer can prove that an operand to a float
operation is uniquely referenced (refcount 1), mutate it in place
instead of allocating a new PyFloatObject.

New tier 2 micro-ops:
- _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE (unique LHS)
- _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE_RIGHT (unique RHS)
- _UNARY_NEGATIVE_FLOAT_INPLACE (unique operand)

Speeds up the pyperformance nbody benchmark by ~19%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid compound assignment (+=, -=, *=) directly on ob_fval in
inplace float ops. On 32-bit Windows, this generates JIT stencils
with _xmm register references that MSVC cannot parse. Instead,
read into a local double, compute, and write back.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add -mno-sse to clang args for i686-pc-windows-msvc target. The COFF32
stencil converter cannot handle _xmm register references that clang
emits for inline float arithmetic. Using x87 FPU instructions avoids
this. SSE is optional on 32-bit x86; x87 is the baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eendebakpt eendebakpt changed the title Draft: gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only) gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only) Mar 22, 2026
@eendebakpt
Copy link
Contributor Author

Selected pyperformance benchmarks:

### chaos ###
Mean +- std dev: 60.7 ms +- 1.5 ms -> 59.5 ms +- 1.5 ms: 1.02x faster
Not significant

### fannkuch ###
Mean +- std dev: 376 ms +- 4 ms -> 375 ms +- 3 ms: 1.00x faster
Not significant

### float ###
Mean +- std dev: 60.9 ms +- 3.0 ms -> 58.8 ms +- 2.7 ms: 1.04x faster
Significant (t=4.00)

### nbody ###
Mean +- std dev: 87.2 ms +- 1.1 ms -> 72.3 ms +- 0.4 ms: 1.21x faster
Significant (t=98.34)

### raytrace ###
Mean +- std dev: 304 ms +- 5 ms -> 299 ms +- 5 ms: 1.02x faster
Not significant

### scimark_fft ###
Mean +- std dev: 305 ms +- 1 ms -> 297 ms +- 3 ms: 1.03x faster
Significant (t=19.01)

### scimark_lu ###
Mean +- std dev: 86.5 ms +- 0.4 ms -> 84.8 ms +- 0.4 ms: 1.02x faster
Not significant

### scimark_monte_carlo ###
Mean +- std dev: 60.7 ms +- 0.7 ms -> 59.8 ms +- 0.3 ms: 1.02x faster
Not significant

### scimark_sor ###
Mean +- std dev: 101 ms +- 1 ms -> 99 ms +- 3 ms: 1.03x faster
Significant (t=7.38)

### scimark_sparse_mat_mult ###
Mean +- std dev: 5.53 ms +- 0.02 ms -> 5.14 ms +- 0.02 ms: 1.08x faster
Significant (t=130.74)

### spectral_norm ###
Mean +- std dev: 84.0 ms +- 0.6 ms -> 79.0 ms +- 0.4 ms: 1.06x faster
Significant (t=53.73)

r = right;
if (PyJitRef_IsUnique(left)) {
ADD_OP(_BINARY_OP_SUBTRACT_FLOAT_INPLACE, 0, 0);
l = PyJitRef_Borrow(left);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it more correct to say l = sym_new_null(ctx);? Same for below?

Comment on lines +804 to +806
res = left;
l = PyStackRef_NULL;
r = right;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a little confusing. Could you please add a comment saying.
Just this one is enough, as the rest can be explained using the same.

Suggested change
res = left;
l = PyStackRef_NULL;
r = right;
// Transfer ownership of left to res.
// Original left is now dead.
res = left;
l = PyStackRef_NULL;
r = right;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants