GH-50280: [C++] Implement VisitTwoBitRuns and VisitTwoSetBitRuns methods#50281
GH-50280: [C++] Implement VisitTwoBitRuns and VisitTwoSetBitRuns methods#50281HuaHuaY wants to merge 1 commit into
Conversation
|
|
|
We have toy element-wise summation benchmarks in |
|
Also, |
36c7481 to
51ee9fa
Compare
|
I have renamed the benchmark file and added some cases use VisitXXXBitRuns. And I added two new methods BTW I refactored the existed method |
|
Now benchmarks about new method |
|
@pitrou How to run benchmark in |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit c0d6bf7. Watch https://buildkite.com/apache-arrow and https://conbench.arrow-dev.org for updates. A comment will be posted here when the runs are complete. |
|
Thanks for your patience. Conbench analyzed the 3 benchmarking runs that have been run so far on PR commit c0d6bf7. There were 5 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
Interesting that benchmark was run of behalf of @rok |
|
@wgtmac That's a current workaround for not having a conbench user. :) |
|
I've compiled the test data into three tables. However, I was somewhat surprised that in x86 tests, There were also some unexpected findings. Currently, Based on these test results, I plan to revert the changes to 4ab7d672 / amd64-c6a-4xlarge-linux / c0d6bf7
fd5a984c / amd64-m5-4xlarge-linux / c0d6bf7
457fe991 / test-mac-arm / c0d6bf7
|
Because it's simpler :) |
|
With the help of AI, I investigated why I concluded that the primary reason lies in the absence of the Additionally, |
|
I think NOINLINE was added at some point because it actually made some benchmarks better (on some compiler :-)). So let's be careful and check for regressions if we undo it. Note that, ultimately, what matters is actual users of this function (for example in the compute library), not the bitmap traversal micro-benchmarks :-) |
|
Yeah, I was trying to figure out why I submitted a new commit that made some optimizations to
As you said, user performance is more important than benchmark performance. These changes not only improved my local benchmark but also, I believe, semantically optimized the execution process. And I don't add |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit 3b5b2d0. Watch https://buildkite.com/apache-arrow and https://conbench.arrow-dev.org for updates. A comment will be posted here when the runs are complete. |
|
I added a new benchmark to count the number of valid int8 values in a random
|
|
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 3b5b2d0. There were 3 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
These are the results from the last benchmark and this benchmark, grouped in pairs, with the latest results following. I think that after optimizing BitRunReader, the performance gap between VisitBitRuns and VisitSetBitRuns has indeed narrowed in most cases. @pitrou Please take a look. 4ab7d672 / amd64-c6a-4xlarge-linux / c0d6bf7
d164bbbe / amd64-c6a-4xlarge-linux / 3b5b2d0
fd5a984c / amd64-m5-4xlarge-linux / c0d6bf7
e74c6927 / amd64-m5-4xlarge-linux / 3b5b2d0
457fe991 / test-mac-arm / c0d6bf7
6701543e / test-mac-arm / 3b5b2d0
|
|
The benchmarks above don't seem to show significant wins from the BitRunReader improvements. Actually, there is a number of small-ish regressions. |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit b026a7d. Watch https://buildkite.com/apache-arrow and https://conbench.arrow-dev.org for updates. A comment will be posted here when the runs are complete. |
|
I've only modified However, even if the bitmap benchmark results improve, it doesn't mean it won't negatively impact other benchmark cases. I looked at the three test cases in the benchmark report above. Group 1A: 4ab7d672 / amd64-c6a-4xlarge-linux / c0d6bf7
Group 2A: fd5a984c / amd64-m5-4xlarge-linux / c0d6bf7
Group 3A: 457fe991 / test-mac-arm / c0d6bf7
|
|
Thanks for your patience. Conbench analyzed the 3 benchmarking runs that have been run so far on PR commit b026a7d. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
|
@ursabot please benchmark lang=C++ |
|
Commit b026a7d already has scheduled benchmark runs. |
b026a7d to
41af9ca
Compare
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit 41af9ca. Watch https://buildkite.com/apache-arrow and https://conbench.arrow-dev.org for updates. A comment will be posted here when the runs are complete. |
|
A performance regression has indeed occurred on |
41af9ca to
4f4b336
Compare
|
Hi @pitrou, I have reverted all code changes on existing paths; the rest are new code and should not affect existing benchmarks. Please take a look again. |
|
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 41af9ca. There were 18 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
|
||
| namespace arrow::internal { | ||
|
|
||
| struct NestedBitmapTraversalBenchmark { |
There was a problem hiding this comment.
I don't think this benchmark is very realistic, because if we were implementing something like this we'd be doing it for an arbitrary nesting level (not 3), and we would probably not use the bitmap visitors for it.
(perhaps we'd BitmapAnd the validity bitmaps in fixed-size chunks and call CountSetBits on the latter?)
There was a problem hiding this comment.
The use case I provided here isn't quite accurate. I think a more accurate use case would be using the type struct<a: list<int8>>. I encountered this during a recursive type parsing process. When encountering list<XXX>, it combines the accumulated bitmap passed from the outer layer with the bitmap of the list to determine the valid rows of the XXX array. Its use case should be similar to VisitTwoBitBlocks, the key point being that sometimes we don't need BitmapAnd to generate a temporary bitmap. Compared to VisitTwoBitBlocks, it simply changes the data input from row-by-row to a batch interface. I plan to modify this benchmark later. If you have any better suggestions, please let me know.
| thread_pool_test.cc) | ||
|
|
||
| add_arrow_benchmark(bit_block_counter_benchmark) | ||
| add_arrow_benchmark(bitmap_traversal_benchmark) |
There was a problem hiding this comment.
I understand why you're renaming the benchmark executable, but I'm afraid it's gonna break the benchmark history by making it unrelated in the benchmark database. @rok Am I right?
There was a problem hiding this comment.
I believe so. If you have a good reason to and you don't care about breaking history that's ok. If not I'd suggest to keep the current name.
There was a problem hiding this comment.
Get it. I will revert the name change.
| ASSERT_THAT(runs, ElementsAreArray({SetBitRun{0, 2}, SetBitRun{3, 2}})); | ||
| } | ||
|
|
||
| TEST_F(TestSetBitRunReader, VisitTwoBitRunsVoid) { |
There was a problem hiding this comment.
Can the VisitTwoBitRunsVoid tests be folded into the VisitTwoBitRuns tests as they use the exact same data? It will make the whole test file more readable and maintainable.
Rationale for this change
We already have
VisitTwoBitBlockswhich can iterate over two bitmaps simultaneously. But there is no mirror code aboutBitRunReaderandSetBitRunReader.What changes are included in this PR?
Implement
VisitTwoBitRunsandVisitTwoSetBitRunsmethods and refactorVisitTwoBitBlocksinGroupedPivotAccumulatoras examples.Are these changes tested?
Yes.
Are there any user-facing changes?
No.
VisitTwoBitRunsandVisitTwoSetBitRuns#50280