perf: Rewrite scheduling of string merging tasks by davidlattimore · Pull Request #1240 · wild-linker/wild

davidlattimore · 2025-10-25T04:18:37Z

The old do_splitting_work, if there were lots of threads and insufficient work to do, could end up effectively busy-waiting, which was terrible for performance.

We now spawn a separate task for each bit of work. We also now constrain the number of split input section Vecs that we'll keep in memory.

Issue #1085

davidlattimore · 2025-10-25T09:24:58Z

On my laptop at least, this is relatively performance-neutral. On benchmarks with very little by way of strings to merge, it seems to be an improvement. It now seems to not slow down when more threads are added. It doesn't support separately limiting the number of threads used for string merging. There are a few parameters that can, at least for the moment be experimentally tweaked by passing three numeric parameters to --wild-experiments=80,16,1024 (values here are the current defaults). The three values are, respectively, a multiplier that adjusts how many buffers are available for intermediate work, how much to split input sections and the minimum size in bytes for sections to be split. It's a bit of guesswork as to what would be good values for these. At least for me, I don't see a heap of difference when I adjust these values, but I also haven't tried anything especially scientific as yet. I'll probably do more experiments on this before I merge. I'm interested as to how this looks performance-wise for others.

mati865 · 2025-10-25T16:20:45Z

I don't have time to dive into it so posting just a quick run on Clang without debug info:

Details

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./run-with ~/Projects/wild/target/release/wild-base --threads=1`	410.8 ± 3.2	407.2	416.3	5.53 ± 0.11
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=1`	413.0 ± 4.6	406.1	420.6	5.56 ± 0.12
`./run-with ~/Projects/wild/target/release/wild-base --threads=2`	223.4 ± 2.4	220.7	229.3	3.01 ± 0.06
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=2`	223.2 ± 4.8	213.7	234.6	3.00 ± 0.08
`./run-with ~/Projects/wild/target/release/wild-base --threads=3`	158.8 ± 1.5	156.0	161.1	2.14 ± 0.04
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=3`	159.7 ± 1.8	157.1	162.9	2.15 ± 0.05
`./run-with ~/Projects/wild/target/release/wild-base --threads=4`	126.5 ± 3.7	117.8	132.7	1.70 ± 0.06
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=4`	129.6 ± 4.3	118.4	133.9	1.74 ± 0.07
`./run-with ~/Projects/wild/target/release/wild-base --threads=5`	105.2 ± 5.4	98.2	114.0	1.42 ± 0.08
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=5`	108.8 ± 4.0	99.9	114.1	1.46 ± 0.06
`./run-with ~/Projects/wild/target/release/wild-base --threads=6`	96.6 ± 4.9	87.7	102.5	1.30 ± 0.07
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=6`	100.0 ± 1.5	96.1	102.6	1.35 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=7`	91.4 ± 3.0	82.6	95.2	1.23 ± 0.05
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=7`	90.1 ± 2.3	84.2	93.7	1.21 ± 0.04
`./run-with ~/Projects/wild/target/release/wild-base --threads=8`	86.9 ± 1.7	83.1	90.6	1.17 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=8`	86.3 ± 1.1	84.0	88.4	1.16 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=9`	82.4 ± 1.4	77.6	84.7	1.11 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=9`	81.5 ± 1.4	78.3	84.3	1.10 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=10`	79.1 ± 1.2	76.7	83.0	1.06 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=10`	79.4 ± 1.1	77.1	82.7	1.07 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-base --threads=11`	77.2 ± 1.2	75.1	80.3	1.04 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=11`	78.0 ± 1.7	75.5	82.7	1.05 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=12`	76.0 ± 1.0	74.7	78.7	1.02 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=12`	76.1 ± 0.9	74.4	77.9	1.02 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-base --threads=13`	75.6 ± 1.0	73.2	78.4	1.02 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=13`	75.3 ± 1.1	73.2	77.8	1.01 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-base --threads=14`	74.7 ± 0.9	72.8	78.1	1.01 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=14`	74.9 ± 1.2	72.7	77.8	1.01 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-base --threads=15`	75.0 ± 1.7	71.9	78.3	1.01 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=15`	75.3 ± 1.1	73.5	79.2	1.01 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-base --threads=16`	75.0 ± 1.4	72.2	80.4	1.01 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=16`	75.5 ± 1.4	72.9	79.6	1.02 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=17`	74.8 ± 1.1	72.7	77.4	1.01 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=17`	76.0 ± 1.1	73.6	78.7	1.02 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-base --threads=18`	74.6 ± 1.1	72.4	77.0	1.00 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=18`	76.4 ± 1.5	73.5	80.1	1.03 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=19`	74.3 ± 1.4	71.5	78.7	1.00
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=19`	75.9 ± 1.2	73.9	79.1	1.02 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-base --threads=20`	75.3 ± 1.5	72.4	78.1	1.01 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=20`	75.8 ± 1.3	73.2	79.8	1.02 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=21`	75.1 ± 1.3	72.7	78.2	1.01 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=21`	77.0 ± 1.1	74.9	80.2	1.04 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-base --threads=22`	75.1 ± 1.1	73.2	77.4	1.01 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=22`	77.0 ± 1.5	74.4	79.9	1.04 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=23`	75.0 ± 1.3	72.3	78.8	1.01 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=23`	78.2 ± 1.2	75.7	81.2	1.05 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-base --threads=24`	75.7 ± 1.1	73.2	77.7	1.02 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=24`	78.9 ± 1.3	75.9	81.2	1.06 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=25`	76.1 ± 1.3	73.2	78.8	1.02 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=25`	78.3 ± 1.3	75.5	81.8	1.05 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=26`	76.1 ± 1.5	72.4	79.3	1.02 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=26`	79.6 ± 1.1	77.2	82.7	1.07 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-base --threads=27`	76.4 ± 1.4	73.6	78.9	1.03 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=27`	78.1 ± 1.3	76.1	82.8	1.05 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=28`	76.9 ± 1.3	73.0	79.4	1.03 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=28`	79.7 ± 1.6	77.5	84.4	1.07 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=29`	77.1 ± 1.3	75.0	79.9	1.04 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=29`	81.7 ± 1.1	79.4	84.7	1.10 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=30`	77.7 ± 1.2	75.7	80.6	1.05 ± 0.02
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=30`	83.1 ± 1.3	81.1	85.9	1.12 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=31`	77.6 ± 1.6	75.2	81.2	1.04 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=31`	81.5 ± 1.4	78.6	85.7	1.10 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-base --threads=32`	79.0 ± 1.9	75.9	83.6	1.06 ± 0.03
`./run-with ~/Projects/wild/target/release/wild-pwmu --threads=32`	84.5 ± 1.4	82.3	88.1	1.14 ± 0.03

mati865 · 2025-10-26T14:08:28Z

-                    }
-                });
-            }
+        let num_threads = rayon::current_num_threads();


Leaving it to the default number of threads seems to be a rather big hit beyond 20 threads.
wild-pwmu is this PR and wild-pwmu2 is this PR with a hack:

Suggested change

let num_threads = rayon::current_num_threads();

let num_threads = args.numeric_experiment(Experiment::MergeStringCapMultiplier, 80) as usize

/ 10;

https://gist.github.com/mati865/e90546d1e9d31b880db57a92e0de84ce

mati865 · 2025-10-26T14:33:41Z

Some more benchmarks using Clang binary. Using multiple groups per thread seems to hurt (at least for Clang) and the slowdown scales: https://gist.github.com/mati865/8a85880857049f7b4f9ad5ce2843d22d

The old do_splitting_work, if there were lots of threads and insufficient work to do, could end up effectively busy-waiting, which was terrible for performance. We now spawn a separate task for each bit of work. We also now constrain the number of split input section Vecs that we'll keep in memory. Issue #1085

davidlattimore · 2025-10-29T10:23:03Z

Thanks for running those benchmarks! I ended up getting rid of the code that was targeting a particular number of groups. We now just have a constant byte size after which we'll split to a new group. This and various other tweaks that I made gave some further improvements to some benchmarks. I mostly benchmarked clang with debug info, which with 32 threads now shows about a 20% improvement for me. Other benchmarks, especially with a smaller number of threads, showed smaller improvements or in some cases no real change.

mati865 · 2025-10-29T15:30:57Z

Impressive!
Using Clang without debuginfo it's not only faster than the base, but also doesn't suffer from the scaling issues: https://gist.github.com/mati865/9e2ea19483981ad4d976f0a9ab147326

davidlattimore force-pushed the push-pwmuvnpvxwyw branch from fe9d258 to 976c7eb Compare October 25, 2025 07:18

mati865 reviewed Oct 26, 2025

View reviewed changes

davidlattimore force-pushed the push-pwmuvnpvxwyw branch from 976c7eb to 04802c0 Compare October 29, 2025 10:18

davidlattimore merged commit ddf5b3c into main Oct 30, 2025
20 checks passed

davidlattimore deleted the push-pwmuvnpvxwyw branch October 30, 2025 23:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

perf: Rewrite scheduling of string merging tasks#1240

perf: Rewrite scheduling of string merging tasks#1240
davidlattimore merged 1 commit into
mainfrom
push-pwmuvnpvxwyw

davidlattimore commented Oct 25, 2025

Uh oh!

davidlattimore commented Oct 25, 2025

Uh oh!

mati865 commented Oct 25, 2025

Uh oh!

mati865 Oct 26, 2025

Uh oh!

mati865 commented Oct 26, 2025

Uh oh!

davidlattimore commented Oct 29, 2025

Uh oh!

mati865 commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	let num_threads = rayon::current_num_threads();
	let num_threads = args.numeric_experiment(Experiment::MergeStringCapMultiplier, 80) as usize
	/ 10;

Uh oh!

Uh oh!

Conversation

davidlattimore commented Oct 25, 2025

Uh oh!

davidlattimore commented Oct 25, 2025

Uh oh!

mati865 commented Oct 25, 2025

Uh oh!

mati865 Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

mati865 commented Oct 26, 2025

Uh oh!

davidlattimore commented Oct 29, 2025

Uh oh!

mati865 commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants