perf: Rewrite scheduling of string merging tasks#1240
Conversation
fe9d258 to
976c7eb
Compare
|
On my laptop at least, this is relatively performance-neutral. On benchmarks with very little by way of strings to merge, it seems to be an improvement. It now seems to not slow down when more threads are added. It doesn't support separately limiting the number of threads used for string merging. There are a few parameters that can, at least for the moment be experimentally tweaked by passing three numeric parameters to |
|
I don't have time to dive into it so posting just a quick run on Clang without debug info: Details
|
| } | ||
| }); | ||
| } | ||
| let num_threads = rayon::current_num_threads(); |
There was a problem hiding this comment.
Leaving it to the default number of threads seems to be a rather big hit beyond 20 threads.
wild-pwmu is this PR and wild-pwmu2 is this PR with a hack:
| let num_threads = rayon::current_num_threads(); | |
| let num_threads = args.numeric_experiment(Experiment::MergeStringCapMultiplier, 80) as usize | |
| / 10; |
https://gist.github.com/mati865/e90546d1e9d31b880db57a92e0de84ce
|
Some more benchmarks using Clang binary. Using multiple groups per thread seems to hurt (at least for Clang) and the slowdown scales: https://gist.github.com/mati865/8a85880857049f7b4f9ad5ce2843d22d |
The old do_splitting_work, if there were lots of threads and insufficient work to do, could end up effectively busy-waiting, which was terrible for performance. We now spawn a separate task for each bit of work. We also now constrain the number of split input section Vecs that we'll keep in memory. Issue #1085
976c7eb to
04802c0
Compare
|
Thanks for running those benchmarks! I ended up getting rid of the code that was targeting a particular number of groups. We now just have a constant byte size after which we'll split to a new group. This and various other tweaks that I made gave some further improvements to some benchmarks. I mostly benchmarked clang with debug info, which with 32 threads now shows about a 20% improvement for me. Other benchmarks, especially with a smaller number of threads, showed smaller improvements or in some cases no real change. |
|
Impressive! |
The old do_splitting_work, if there were lots of threads and insufficient work to do, could end up effectively busy-waiting, which was terrible for performance.
We now spawn a separate task for each bit of work. We also now constrain the number of split input section Vecs that we'll keep in memory.
Issue #1085