buffer: improve performance of multiple Buffer operations#61871
buffer: improve performance of multiple Buffer operations#61871thisalihassan wants to merge 10 commits intonodejs:mainfrom
Conversation
|
Review requested:
|
d2ba38f to
495feb5
Compare
| void FastSwap16(Local<Value> receiver, | ||
| Local<Value> buffer_obj, | ||
| // NOLINTNEXTLINE(runtime/references) | ||
| FastApiCallbackOptions& options) { | ||
| HandleScope scope(options.isolate); | ||
| ArrayBufferViewContents<char> buffer(buffer_obj); | ||
| CHECK(nbytes::SwapBytes16(const_cast<char*>(buffer.data()), buffer.length())); | ||
| } |
There was a problem hiding this comment.
These fast callbacks are non-identical to the conventional callbacks they shadow.
- The existing callbacks validate their argument and throws to JS if invalid, whereas your fast callbacks hard-crash the process. It might be better to validate in the JS layer, then use the same unwrapping logic on both sides.
- Your fast callback cannot have a different return convention to the conventional callback. You will need to remove the return value from the conventional callback.
There was a problem hiding this comment.
Was removing the validation instead of moving it to js intentional?
Also fast paths can keep the validation and fall back to slow i think, so we can still validate on the src side?
There was a problem hiding this comment.
Also fast paths can keep the validation and fall back to slow i think, so we can still validate on the src side?
This is outdated, the fast API doesn't use fallback any more (since Node.js v23.x).
However, any validation in a JS wrapper should be shadowed by a CHECK or DCHECK in the C++ binding.
There was a problem hiding this comment.
SPREAD_BUFFER_ARG already contains CHECK((val)->IsArrayBufferView()) on all Slow & Fast swap methods
| void FastSwap16(Local<Value> receiver, | ||
| Local<Value> buffer_obj, | ||
| // NOLINTNEXTLINE(runtime/references) | ||
| FastApiCallbackOptions& options) { | ||
| HandleScope scope(options.isolate); | ||
| ArrayBufferViewContents<char> buffer(buffer_obj); | ||
| CHECK(nbytes::SwapBytes16(const_cast<char*>(buffer.data()), buffer.length())); | ||
| } |
There was a problem hiding this comment.
Fast callbacks should include debug tracking and call tests.
There was a problem hiding this comment.
Thanks for sharing these I will update the code
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #61871 +/- ##
==========================================
- Coverage 91.60% 89.68% -1.92%
==========================================
Files 337 676 +339
Lines 140745 206716 +65971
Branches 21802 39585 +17783
==========================================
+ Hits 128925 185400 +56475
- Misses 11595 13449 +1854
- Partials 225 7867 +7642
🚀 New features to boost your workflow:
|
1395d2f to
01ba74f
Compare
src/node_buffer.cc
Outdated
| FastApiCallbackOptions& options) { | ||
| TRACK_V8_FAST_API_CALL("buffer.swap16"); | ||
| HandleScope scope(options.isolate); | ||
| ArrayBufferViewContents<char> buffer(buffer_obj); |
There was a problem hiding this comment.
ArrayBufferViewContents is wrong here, as buffer.data() may be a stack-allocated copy of the byte data rather than the data itself. SPREAD_BUFFER_ARG is the correct macro to use here, as per the conventional callback.
There was a problem hiding this comment.
@Renegade334 replaced ArrayBufferViewContents with SPREAD_BUFFER_ARG in all three fast swap callbacks
There was a problem hiding this comment.
For toHex, wait until #61609, which improves native perf significantly (more than Uint8Array.prototype.toHex)
See also #60249 (comment)
| } else if (value.length === 1) { | ||
| // Fast path: If `value` fits into a single byte, use that numeric value. | ||
| if (normalizedEncoding === 'utf8') { | ||
| if (normalizedEncoding === 'utf8' || normalizedEncoding === 'ascii') { |
There was a problem hiding this comment.
Currently, ascii behaves exactly like latin1
Unsure if by design or accidentally
There was a problem hiding this comment.
yes, this is safe by design I am just extending the existing single byte numeric optimization to cover ASCII, since the guard already constrains it to the valid ASCII range.
|
Note on toBase64 / toBase64url: I also tried replacing the C++ base64Slice/base64urlSlice bindings with V8's Uint8Array.prototype.toBase64() (similar to the toHex change) but it caused a 35-54% regression across all buffer sizes so I reverted base64/base64url and kept only the toHex optimization which showed a clear +26-37% win. |
|
@thisalihassan toHex doesn't show a win anymore with Instead, it's ~3x slower. See nodejs/nbytes#12 |
|
Hi @ChALkeR thanks for flagging I was not aware. I benchmarked the nibble approach locally and it's indeed a much bigger win (~3x vs my ~30% with toHex). Reverted the toHex path entirely the other changes in this PR are unaffected. Should I include the nbytes nibble HexEncode optimization in this PR or keep them as separate PRs?
|
Hi @ChALkeR I see that your changes landed, congratualtions on that huge win. Is there benchmark-ci that we can run on this PR? I can rebase it |
|
I re-ran the benchmark against the latest main (which contains the nibble improvement) and found out there are few regressions caused by my code:
I tried fixing this regression but unavoidable Attaching Details below: buffer-benchmark-all-rebased.csv -- raw benchmark CSV |
|
compare-R-output.txt-- full output from Rscript benchmark/compare.R (shared as a file to avoid cluttering the PR comment). |
59f5d09 to
48abfc5
Compare
|
PS: I resolved some conflicts |
|
Hi @ChALkeR @anonrig @Renegade334 is this PR ready to land? can you also please check the latest benchmarks i have posted, I have compiled PDF for this benchmark in one my comments above for more detail
|
Qard
left a comment
There was a problem hiding this comment.
Generally LGTM, but one small nit.
Failed to start CI⚠ Commits were pushed since the last approving review: ⚠ - buffer: improve performance of multiple Buffer operations ⚠ - buffer: address PR feedback ⚠ - buffer: revert toHex in favour of nbytes HexEncode update ⚠ - resolve feedback ⚠ - resolve feedback on fast callbacks ⚠ - added comments and updated tests ⚠ - resolve feedback ✘ Refusing to run CI on potentially unsafe PRhttps://github.com/nodejs/node/actions/runs/22977319444 |
|
CI: https://ci.nodejs.org/job/node-test-pull-request/71839/ Results |
lib/buffer.js
Outdated
| if (length === undefined && typeof offset === 'string') { | ||
| encoding = offset; | ||
| length = this.length; | ||
| length = len; |
There was a problem hiding this comment.
This is extremely confusing. Please use more readable variable names.
|
|
||
| Buffer.prototype.swap16 = function swap16() { | ||
| // For Buffer.length < 128, it's generally faster to | ||
| // For Buffer.length <= 32, it's generally faster to |
There was a problem hiding this comment.
Can you add a reference link to this pull-request above this line
| void Swap16(const FunctionCallbackInfo<Value>& args) { | ||
| Environment* env = Environment::GetCurrent(args); | ||
| THROW_AND_RETURN_UNLESS_BUFFER(env, args[0]); | ||
| DCHECK(args[0]->IsArrayBufferView()); |
There was a problem hiding this comment.
Although I agree that DCHECK is sufficient here, we always use non-debug CHECK's. If we want to replace this, we should do it in all of the codebase.
| DCHECK(args[0]->IsArrayBufferView()); | |
| CHECK(args[0]->IsArrayBufferView()); |
There was a problem hiding this comment.
we already haveSPREAD_BUFFER_ARG for Check thenDCheck looks redudand want me to remove this?
| FastApiCallbackOptions& options) { | ||
| TRACK_V8_FAST_API_CALL("buffer.swap16"); | ||
| HandleScope scope(options.isolate); | ||
| SPREAD_BUFFER_ARG(buffer_obj, ts_obj); |
There was a problem hiding this comment.
This fast api call is missing DCHECK(buffer_obj->IsArrayBufferView());
There was a problem hiding this comment.
SPREAD_BUFFER_ARG already do CHECK((val)->IsArrayBufferView()); which is why I didn't add DCheck here
| void Swap32(const FunctionCallbackInfo<Value>& args) { | ||
| Environment* env = Environment::GetCurrent(args); | ||
| THROW_AND_RETURN_UNLESS_BUFFER(env, args[0]); | ||
| DCHECK(args[0]->IsArrayBufferView()); |
There was a problem hiding this comment.
| DCHECK(args[0]->IsArrayBufferView()); | |
| CHECK(args[0]->IsArrayBufferView()); |
| FastApiCallbackOptions& options) { | ||
| TRACK_V8_FAST_API_CALL("buffer.swap32"); | ||
| HandleScope scope(options.isolate); | ||
| SPREAD_BUFFER_ARG(buffer_obj, ts_obj); |
There was a problem hiding this comment.
check is arraybufferview is missing.
There was a problem hiding this comment.
SPREAD_BUFFER_ARG already do CHECK((val)->IsArrayBufferView()); which is why I didn't add DCheck here
| FastApiCallbackOptions& options) { | ||
| TRACK_V8_FAST_API_CALL("buffer.swap64"); | ||
| HandleScope scope(options.isolate); | ||
| SPREAD_BUFFER_ARG(buffer_obj, ts_obj); |
There was a problem hiding this comment.
check is arraybufferview is missing
There was a problem hiding this comment.
same as above
| void Swap64(const FunctionCallbackInfo<Value>& args) { | ||
| Environment* env = Environment::GetCurrent(args); | ||
| THROW_AND_RETURN_UNLESS_BUFFER(env, args[0]); | ||
| DCHECK(args[0]->IsArrayBufferView()); |
There was a problem hiding this comment.
| DCHECK(args[0]->IsArrayBufferView()); | |
| CHECK(args[0]->IsArrayBufferView()); |
- copyBytesFrom: calculate byte offsets directly instead of
slicing into an intermediate typed array
- toString('hex'): use V8 Uint8Array.prototype.toHex() builtin
- fill: add single-char ASCII fast path
- indexOf: use indexOfString directly for ASCII encoding
- swap16/32/64: add V8 Fast API functions
- Guard ensureUint8ArrayToHex against --no-js-base-64 flag by falling back to C++ hexSlice when toHex is unavailable - Remove THROW_AND_RETURN_UNLESS_BUFFER and return value from slow Swap16/32/64 to match fast path conventions (JS validates) - Add TRACK_V8_FAST_API_CALL to FastSwap16/32/64 - Add test/parallel/test-buffer-swap-fast.js for fast API verification
Remove V8 Uint8Array.prototype.toHex() path for Buffer.toString('hex')
in favour of the upcoming nbytes HexEncode improvement (nodejs/nbytes#12)
which is ~3x faster through the existing C++ hexSlice path.
Refs: nodejs/nbytes#12
Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>
1d16ad4 to
f1d938b
Compare
| let end = viewLength; | ||
|
|
||
| view = TypedArrayPrototypeSlice(view, offset, end); | ||
| if (offset !== undefined) { |
There was a problem hiding this comment.
Would switching from TypedArrayPrototypeSlice to TypedArrayPrototypeSubarray be easier here? It avoids the extraneous copy possibly eliminates the more complicated manual processing here.
There was a problem hiding this comment.
Ok @jasnell I tried as you suggested, here's my implementation did bench with the current implementation and there was consistant 14-20% of regresssion with TypedArrayPrototypeSubarray although I like the code, it's much cleaner
Buffer.copyBytesFrom = function copyBytesFrom(view, offset, length) {
if (!isTypedArray(view)) {
throw new ERR_INVALID_ARG_TYPE('view', [ 'TypedArray' ], view);
}
const viewLength = TypedArrayPrototypeGetLength(view);
if (viewLength === 0) {
return new FastBuffer();
}
let start = 0;
let end = viewLength;
if (offset !== undefined) {
validateInteger(offset, 'offset', 0);
if (offset >= viewLength) return new FastBuffer();
start = offset;
}
if (length !== undefined) {
validateInteger(length, 'length', 0);
end = start + length;
}
view = TypedArrayPrototypeSubarray(view, start, end);
return fromArrayLike(new Uint8Array(
TypedArrayPrototypeGetBuffer(view),
TypedArrayPrototypeGetByteOffset(view),
TypedArrayPrototypeGetByteLength(view)));
};There was a problem hiding this comment.
Here's the bench between current imlementation vs your suggestion:
thisalihassan@Alis-MacBook-Pro node % cat local/benchmarks/buffers/jsnell-copybytes-subarray.csv | Rscript benchmark/compare.R
confidence improvement accuracy (*) (**) (***)
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=2048 type='Float64Array' -0.38 % ±3.41% ±4.53% ±5.90%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=2048 type='Uint16Array' -2.45 % ±2.93% ±3.90% ±5.07%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=2048 type='Uint32Array' * 4.21 % ±3.69% ±4.92% ±6.41%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=2048 type='Uint8Array' *** -8.81 % ±2.34% ±3.11% ±4.06%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=256 type='Float64Array' *** -9.65 % ±2.32% ±3.08% ±4.02%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=256 type='Uint16Array' *** -14.88 % ±1.09% ±1.45% ±1.89%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=256 type='Uint32Array' *** -12.90 % ±2.06% ±2.75% ±3.62%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=256 type='Uint8Array' *** -18.38 % ±2.80% ±3.76% ±4.96%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=64 type='Float64Array' *** -14.73 % ±1.08% ±1.44% ±1.88%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=64 type='Uint16Array' *** -21.16 % ±0.91% ±1.21% ±1.58%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=64 type='Uint32Array' *** -19.46 % ±2.12% ±2.84% ±3.74%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=64 type='Uint8Array' *** -21.50 % ±0.94% ±1.25% ±1.64%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=2048 type='Float64Array' 1.55 % ±2.88% ±3.83% ±4.99%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=2048 type='Uint16Array' *** -9.22 % ±1.52% ±2.03% ±2.64%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=2048 type='Uint32Array' -0.56 % ±2.80% ±3.72% ±4.84%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=2048 type='Uint8Array' *** -13.41 % ±1.26% ±1.68% ±2.19%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=256 type='Float64Array' *** -14.31 % ±1.74% ±2.32% ±3.04%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=256 type='Uint16Array' *** -18.77 % ±0.86% ±1.15% ±1.49%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=256 type='Uint32Array' *** -14.93 % ±2.59% ±3.48% ±4.57%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=256 type='Uint8Array' *** -19.45 % ±0.81% ±1.08% ±1.41%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=64 type='Float64Array' *** -16.75 % ±2.95% ±3.94% ±5.14%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=64 type='Uint16Array' *** -20.57 % ±0.92% ±1.23% ±1.61%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=64 type='Uint32Array' *** -19.43 % ±0.74% ±0.98% ±1.28%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=64 type='Uint8Array' *** -20.58 % ±1.11% ±1.48% ±1.92%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=2048 type='Float64Array' *** 13.69 % ±4.00% ±5.32% ±6.93%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=2048 type='Uint16Array' *** -8.51 % ±2.33% ±3.10% ±4.03%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=2048 type='Uint32Array' -2.65 % ±2.74% ±3.65% ±4.77%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=2048 type='Uint8Array' *** -10.05 % ±1.26% ±1.68% ±2.19%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=256 type='Float64Array' *** -10.48 % ±1.17% ±1.55% ±2.02%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=256 type='Uint16Array' *** -15.20 % ±1.14% ±1.52% ±1.98%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=256 type='Uint32Array' *** -13.07 % ±1.19% ±1.59% ±2.07%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=256 type='Uint8Array' *** -19.86 % ±1.38% ±1.84% ±2.42%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=64 type='Float64Array' *** -14.71 % ±0.92% ±1.23% ±1.60%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=64 type='Uint16Array' *** -21.59 % ±1.08% ±1.44% ±1.88%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=64 type='Uint32Array' *** -19.28 % ±1.95% ±2.59% ±3.38%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=64 type='Uint8Array' *** -22.33 % ±1.03% ±1.37% ±1.79%
Be aware that when doing many comparisons the risk of a false-positive
result increases. In this case, there are 36 comparisons, you can thus
expect the following amount of false-positive results:
1.80 false positives, when considering a 5% risk acceptance (*, **, ***),
0.36 false positives, when considering a 1% risk acceptance (**, ***),
0.04 false positives, when considering a 0.1% risk acceptance (***)
There was a problem hiding this comment.
hmm... that would still make it a net improvement over the current implementation. I'll leave it up to you, I'm just not a big fan of the manual handling.
There was a problem hiding this comment.
Since I was aiming to maxing perf, I will keep it if the regression was less I would have kept your solution
| // do the swap in javascript. For larger buffers, | ||
| // dropping down to the native code is faster. | ||
| const len = this.length; | ||
| const len = TypedArrayPrototypeGetLength(this); |
There was a problem hiding this comment.
While this should provide a viable defense against crashing if this isn't a typed array, the error is going to be somewhat non-obvious...
Welcome to Node.js v26.0.0-pre.
Type ".help" for more information.
> const { swap16 } = Buffer.prototype
undefined
> swap16(1)
Uncaught:
TypeError: Method get TypedArray.prototype.length called on incompatible receiver undefined
at get length (<anonymous>)
at swap16 (node:buffer:1265:15)
> Having this instead be a ERR_INVALID_THIS error with a more specific error message would be nicer.
There was a problem hiding this comment.
This would have a performance impact wouldn't it?
There was a problem hiding this comment.
These are used so infrequently that I'm not concerned about the performance impact of an additional isArrayBufferView() type check
There was a problem hiding this comment.
Added isArrayBufferView guard with ERR_INVALID_THIS to all three swap methods. Here's the behavior comparison:
swap16
| Receiver | main branch | This PR |
|---|---|---|
| Buffer(4) | OK | OK |
| Uint8Array(4) | OK | OK |
| Float32Array(4) | OK | OK |
| Int16Array(4) | OK | OK |
| Array(4) | OK | ERR_INVALID_THIS |
| Array(200) | ERR_INVALID_ARG_TYPE | ERR_INVALID_THIS |
| number(1) | ERR_INVALID_BUFFER_SIZE | ERR_INVALID_THIS |
| string("ab") | TypeError (read-only property) | ERR_INVALID_THIS |
| plain object | OK (silent garbage) | ERR_INVALID_THIS |
| null | TypeError | ERR_INVALID_THIS |
| undefined | TypeError | ERR_INVALID_THIS |
I don't think any of this is now considered breaking changes, some used to work but that shouldn't have worked realistically @jasnell
cc: @aduh95
There was a problem hiding this comment.
These are used so infrequently that I'm not concerned about the error being not as helpful as it could be, especially since we don't support passing an arbitrary this. If V8 can get away with it, we can too.
| if (len % 8 !== 0) | ||
| throw new ERR_INVALID_BUFFER_SIZE('64-bits'); | ||
| if (len < 192) { | ||
| if (len < 48) { |
There was a problem hiding this comment.
In the swap16 and swap32, the threshold check is len <=, while this is len < ... is there a particular reason for the difference?
There was a problem hiding this comment.
Yes intentional the crossover point for swap64 is between 40 and 48, check the bench here
There was a problem hiding this comment.
A comment here would be helpful then.




Multiple performance improvements to Buffer operations, verified with benchmarks (15-30 runs, comparing old vs new binaries built from same tree).
Buffer.copyBytesFrom()(+100-210%)Avoid intermediate
TypedArrayPrototypeSliceallocation by calculating byte offsets directly into the source TypedArray's underlying ArrayBuffer.Buffer.prototype.fill("t", "ascii")(+26-37%)ASCII
indexOf(+14-46%)Call
indexOfStringdirectly for ASCII encoding instead of first converting the search value to a Buffer viafromStringFastand then callingindexOfBuffer. ASCII and Latin-1 share the same byte values for characters 0-127.swap16/32/64(+3-38%)Add V8 Fast API C++ functions (
FastSwap16/32/64) alongside the existing slow path. Largest gains at len=256 (+35%).Benchmark results
Key results (15-30 runs,
***= p < 0.001):Attaching Details below:
detail.pdf -- visual breakdown
buffer-benchmark-all-rebased.csv -- raw benchmark CSV
compare-R-output.txt-- full output