Skip to content

perf: Change release profile to use codegen-units=1#1500

Merged
davidlattimore merged 1 commit into
mainfrom
push-qourowxvuxzu
Feb 23, 2026
Merged

perf: Change release profile to use codegen-units=1#1500
davidlattimore merged 1 commit into
mainfrom
push-qourowxvuxzu

Conversation

@davidlattimore

@davidlattimore davidlattimore commented Jan 31, 2026

Copy link
Copy Markdown
Member

This gives more stable performance. i.e. it reduces the extent to which
small changes in code affect the performance of unrelated bits of code.

This also makes the release be more or less what we release, which is
useful if people `cargo install wild-linker.

Faster (to build) optimised builds are now available with the opt
profile.

@davidlattimore

Copy link
Copy Markdown
Member Author

I don't see large differences in either runtime performance or in build times with or without thin LTO. So this is more about consistency than anything else. Open to any thoughts people have as to whether this is a good idea or not.

@lapla-cogito lapla-cogito left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about also setting codegen-units to 1 (or lower than the default of 16)?

@davidlattimore

Copy link
Copy Markdown
Member Author

Based on some preliminary benchmarks, setting codegen-units=1 does have some benefit. I probably need to do some more benchmarking though. The setting does really hurt build times though. A warm release build goes from 16 seconds to 54 seconds. So if we decide to set that for our release builds, then I suspect we'll want a separate profile that's optimised, but not quite as much.

Another possibility is to try to identify which bits are making a difference. e.g. do some perf runs then use perf-diff to try to identify which functions benefited the most from the change. If we can identify that, then we might be able to get similar speedups with a sprinkling of inline annotations, although I guess those are slightly harder with third-party crates.

@davidlattimore

Copy link
Copy Markdown
Member Author

A second motivation for this change is, or at least was, that a lot of my previous benchmarking has been done by comparing release builds, but I had been thinking that I should really compare dist builds, since that's what we distribute. By making them basically the same, I figured I'd be less likely to accidentally benchmark a release without thin-lto.

However, today while I've been running various benchmarks of my linker-plugin PR, I've noticed some odd results. The oddest was that the linker-plugin PR with the feature disabled at build time was showing a 9% slowdown on the bevy-dylib benchmark relative to a build without the PR. However, with a slightly different build configuration (strip disabled), the same benchmark showed a 6% speedup. All these benchmarks were with thin-lto enabled. My theory is that thin-lto is pretty inconsistent with which functions it inlines and which it doesn't. So changes in one part of the codebase can have significant performance effects in unrelated parts of the codebase.

This is not a great property and makes me wonder if we should actually be doing the opposite of this change and turning off thin-lto for the dist profile. Basically I'm wondering if the pretty small performance boost that we maybe get from thin-lto is worth it given how unreliable it makes benchmarking.

To that end, I've just done a bunch of benchmark runs with various different build configurations. The results are here. I did a second run with exactly the same configurations. The results look pretty similar.

It looks like LTO (thin or fat) results in significant variation of timings. Combining strip=true with either fat or thin LTO affects the result. Regular release builds, builds with codegen-units=1 and stripped builds (with or without codegen-units=1) look pretty stable.

@mati865

mati865 commented Jan 31, 2026

Copy link
Copy Markdown
Member

Strip difference is confusing to me since it's done by the linker: https://github.com/rust-lang/rust/blob/1e9be1b77fe89d9757d6179973b2fc970c6e83b7/compiler/rustc_codegen_ssa/src/back/linker.rs#L744
So, I wouldn't expect any differences to originate from it if debuginfo is disabled.

@davidlattimore

Copy link
Copy Markdown
Member Author

I agree it's weird. I think something else must be being affected by the strip setting. I just did some more tests. I captured save-dirs for how the linker was invoked with and without strip=true then changed them so that they both only did --strip-debug not --strip-all. The result was two binaries that still had a significant performance difference. So it seems that setting strip=true is affecting more than just the flags passed to the linker.

@davidlattimore

Copy link
Copy Markdown
Member Author

It's possible it's just that different tracked flags cause a different SVH (strict version hash) which means the names of things passed to LLD are slightly different. Those differences in names might cause LLD to make different decisions. That's just a guess though.

@davidlattimore davidlattimore marked this pull request as draft February 8, 2026 04:07
@davidlattimore

Copy link
Copy Markdown
Member Author

I tracked down the main performance loss for the linker-plugin change (with plugins disabled)... it turned out to be a bit of code that got deleted. The code wasn't even being run by the benchmark since it was in an error path. I then tried repeating the benchmark with codegen-units=1 and the performance loss went away. I think the way rust merges codegen units, although presumably deterministic, is not at all stable in the presence of code changes. So a small code change could affect which initial codegen units get merged with which others resulting in significant changes in performance in parts of the code unrelated to where the code was changed. So I'm leaning more towards the suggestion of setting codegen-units=1 for release builds.

This gives more stable performance. i.e. it reduces the extent to which
small changes in code affect the performance of unrelated bits of code.

This also makes the release be more or less what we release, which is
useful if people `cargo install wild-linker.

Faster (to build) optimised builds are now available with the opt
profile.
@davidlattimore davidlattimore changed the title perf: Change release profile to use thin LTO perf: Change release profile to use codegen-units=1 Feb 13, 2026
@davidlattimore davidlattimore marked this pull request as ready for review February 13, 2026 10:55
Comment thread PACKAGING.md
Comment on lines -12 to -14
that enables compiler's internal ThinLTO and strips the binaries. The benefit from ThinLTO
is very mild in Wild's case, so it's up to you whether to use it. Musl releases also enable
`--feature mimalloc`, see below for the explanation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ThinLTO isn't specified for any builds now, I think there's no need to remove the reason why it isn't enabled here (though I am curious about how much the benefits would change if FatLTO were used instead).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't remove the bit about ThinLTO, then this would be saying that the dist profile uses it, which isn't true anymore.

@davidlattimore davidlattimore merged commit f5b25f1 into main Feb 23, 2026
20 checks passed
@davidlattimore davidlattimore deleted the push-qourowxvuxzu branch February 23, 2026 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants