perf: Change release profile to use codegen-units=1#1500
Conversation
|
I don't see large differences in either runtime performance or in build times with or without thin LTO. So this is more about consistency than anything else. Open to any thoughts people have as to whether this is a good idea or not. |
lapla-cogito
left a comment
There was a problem hiding this comment.
How about also setting codegen-units to 1 (or lower than the default of 16)?
|
Based on some preliminary benchmarks, setting Another possibility is to try to identify which bits are making a difference. e.g. do some perf runs then use perf-diff to try to identify which functions benefited the most from the change. If we can identify that, then we might be able to get similar speedups with a sprinkling of |
|
A second motivation for this change is, or at least was, that a lot of my previous benchmarking has been done by comparing release builds, but I had been thinking that I should really compare dist builds, since that's what we distribute. By making them basically the same, I figured I'd be less likely to accidentally benchmark a release without thin-lto. However, today while I've been running various benchmarks of my linker-plugin PR, I've noticed some odd results. The oddest was that the linker-plugin PR with the feature disabled at build time was showing a 9% slowdown on the bevy-dylib benchmark relative to a build without the PR. However, with a slightly different build configuration (strip disabled), the same benchmark showed a 6% speedup. All these benchmarks were with thin-lto enabled. My theory is that thin-lto is pretty inconsistent with which functions it inlines and which it doesn't. So changes in one part of the codebase can have significant performance effects in unrelated parts of the codebase. This is not a great property and makes me wonder if we should actually be doing the opposite of this change and turning off thin-lto for the dist profile. Basically I'm wondering if the pretty small performance boost that we maybe get from thin-lto is worth it given how unreliable it makes benchmarking. To that end, I've just done a bunch of benchmark runs with various different build configurations. The results are here. I did a second run with exactly the same configurations. The results look pretty similar. It looks like LTO (thin or fat) results in significant variation of timings. Combining |
|
Strip difference is confusing to me since it's done by the linker: https://github.com/rust-lang/rust/blob/1e9be1b77fe89d9757d6179973b2fc970c6e83b7/compiler/rustc_codegen_ssa/src/back/linker.rs#L744 |
|
I agree it's weird. I think something else must be being affected by the strip setting. I just did some more tests. I captured save-dirs for how the linker was invoked with and without |
|
It's possible it's just that different tracked flags cause a different SVH (strict version hash) which means the names of things passed to LLD are slightly different. Those differences in names might cause LLD to make different decisions. That's just a guess though. |
|
I tracked down the main performance loss for the linker-plugin change (with plugins disabled)... it turned out to be a bit of code that got deleted. The code wasn't even being run by the benchmark since it was in an error path. I then tried repeating the benchmark with codegen-units=1 and the performance loss went away. I think the way rust merges codegen units, although presumably deterministic, is not at all stable in the presence of code changes. So a small code change could affect which initial codegen units get merged with which others resulting in significant changes in performance in parts of the code unrelated to where the code was changed. So I'm leaning more towards the suggestion of setting codegen-units=1 for release builds. |
This gives more stable performance. i.e. it reduces the extent to which small changes in code affect the performance of unrelated bits of code. This also makes the release be more or less what we release, which is useful if people `cargo install wild-linker. Faster (to build) optimised builds are now available with the opt profile.
6ce73e6 to
3e3aa79
Compare
| that enables compiler's internal ThinLTO and strips the binaries. The benefit from ThinLTO | ||
| is very mild in Wild's case, so it's up to you whether to use it. Musl releases also enable | ||
| `--feature mimalloc`, see below for the explanation. |
There was a problem hiding this comment.
Since ThinLTO isn't specified for any builds now, I think there's no need to remove the reason why it isn't enabled here (though I am curious about how much the benefits would change if FatLTO were used instead).
There was a problem hiding this comment.
If I don't remove the bit about ThinLTO, then this would be saying that the dist profile uses it, which isn't true anymore.
This gives more stable performance. i.e. it reduces the extent to which
small changes in code affect the performance of unrelated bits of code.
This also makes the release be more or less what we release, which is
useful if people `cargo install wild-linker.
Faster (to build) optimised builds are now available with the opt
profile.