Skip to content

[Optimization] Zero padding in typed copies to help LLVM merge stores#157690

Open
ChuanqiXu9 wants to merge 1 commit into
rust-lang:mainfrom
ChuanqiXu9:PaddingZero
Open

[Optimization] Zero padding in typed copies to help LLVM merge stores#157690
ChuanqiXu9 wants to merge 1 commit into
rust-lang:mainfrom
ChuanqiXu9:PaddingZero

Conversation

@ChuanqiXu9

Copy link
Copy Markdown

Close #157373

For a repr(C) struct with inner padding like:

struct S { a: u16, b: u8, /* 1 byte pad */ c: u32 }

ptr::write generates suboptimal code (3 stores) while MaybeUninit::write generates optimal code (1 wide store).

Root cause: ptr::write lowers to a single alloca + memcpy. LLVM's SROA decomposes the memcpy into per-field stores but skips padding bytes (emitting store undef, which DSE removes), leaving a gap:

store i16 0, ptr %dest        ; a [0,2)
store i8  0, ptr %dest+2      ; b [2,3)
                               ;   [3,4) gap — no store
store i32 0, ptr %dest+4      ; c [4,8)

The backend's store merging cannot combine stores across this gap.

MaybeUninit::write avoids this because its deeper inlining chain produces multiple chained memcpys, which triggers SROA's integer promotion path (treating the whole alloca as i64) instead of per-field forwarding.

Fix: after each typed memcpy, emit explicit memset(0) for padding gaps. This fills the hole so SROA + store merging produce:

store i64 0, ptr %dest        ; single wide store

Constraints to avoid unnecessary overhead or miscompilation:

  • Only FieldsShape::Arbitrary (structs, not arrays/unions/primitives)
  • Only Variants::Single (not multi-variant enums, where gaps may hold other variants' data)
  • Only structs ≤ 16 bytes (larger structs cannot merge into a single wide store anyway)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 10, 2026
@rust-log-analyzer

This comment has been minimized.

Close rust-lang#157373

For a `repr(C)` struct with inner padding like:

    struct S { a: u16, b: u8, /* 1 byte pad */ c: u32 }

`ptr::write` generates suboptimal code (3 stores) while
`MaybeUninit::write` generates optimal code (1 wide store).

Root cause: `ptr::write` lowers to a single alloca + memcpy. LLVM's
SROA decomposes the memcpy into per-field stores but skips padding
bytes (emitting `store undef`, which DSE removes), leaving a gap:

    store i16 0, ptr %dest        ; a [0,2)
    store i8  0, ptr %dest+2      ; b [2,3)
                                   ;   [3,4) gap — no store
    store i32 0, ptr %dest+4      ; c [4,8)

The backend's store merging cannot combine stores across this gap.

`MaybeUninit::write` avoids this because its deeper inlining chain
produces multiple chained memcpys, which triggers SROA's integer
promotion path (treating the whole alloca as i64) instead of per-field
forwarding.

Fix: after each typed memcpy, emit explicit `memset(0)` for padding
gaps. This fills the hole so SROA + store merging produce:

    store i64 0, ptr %dest        ; single wide store

Constraints to avoid unnecessary overhead or miscompilation:
- Only `FieldsShape::Arbitrary` (structs, not arrays/unions/primitives)
- Only `Variants::Single` (not multi-variant enums, where gaps may
  hold other variants' data)
- Only structs ≤ 16 bytes (larger structs cannot merge into a single
  wide store anyway)
@rust-log-analyzer

Copy link
Copy Markdown
Collaborator

The job tidy failed! Check out the build log: (web) (plain enhanced) (plain)

Click to see the possible cause of the failure (guessed by this bot)
[TIMING:end] builder::Libdir { compiler: Compiler { stage: 0, host: x86_64-unknown-linux-gnu, forced_compiler: false }, target: x86_64-unknown-linux-gnu } -- 0.001
##[group]Building stage1 tidy (stage0 -> stage1, x86_64-unknown-linux-gnu)
error: process didn't exit successfully: `sccache /checkout/obj/build/bootstrap/debug/rustc -vV` (exit status: 2)
--- stderr
sccache: error: Timed out waiting for server startup. Maybe the remote service is unreachable?
Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information

Bootstrap failed while executing `test src/tools/tidy tidyselftest --extra-checks=py,cpp,js,spellcheck`
Build completed unsuccessfully in 0:00:31
  local time: Wed Jun 10 05:46:29 UTC 2026
  network time: Wed, 10 Jun 2026 05:46:29 GMT

@ChuanqiXu9

Copy link
Copy Markdown
Author

r? rust-lang/codegen

@dianqk

dianqk commented Jun 12, 2026

Copy link
Copy Markdown
Member

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 12, 2026
@rust-bors

This comment has been minimized.

rust-bors Bot pushed a commit that referenced this pull request Jun 12, 2026
[Optimization] Zero padding in typed copies to help LLVM merge stores

@dianqk dianqk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave this to LLVM. Maybe we can emit store undef in SROA and let instcombine or memcpyopt consume it.

View changes since this review

// CHECK-LABEL: @via_ptr_write(
#[no_mangle]
pub fn via_ptr_write(dest: &mut MaybeUninit<InnerPadded>) {
let val = InnerPadded { a: 0, b: 0, c: 0 };

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let val = InnerPadded { a: 0, b: 0, c: 0 };
let val = InnerPadded { a: 0, b: 1, c: 0 };

Could the new test case be merged into one store with your PR?

@rust-bors

rust-bors Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

☀️ Try build successful (CI)
Build commit: 68b77d3 (68b77d32f1786d41b84fa6dfba36f5c3809b1f9f, parent: b30f3df3ba3c4c9de2f58f1a75dd9500b79b3f8d)

@rust-timer

This comment has been minimized.

@rust-timer

Copy link
Copy Markdown
Collaborator

Finished benchmarking commit (68b77d3): comparison URL.

Overall result: ❌ regressions - please read:

Benchmarking means the PR may be perf-sensitive. It's automatically marked not fit for rolling up. Overriding is possible but disadvised: it risks changing compiler perf.

Next, please: If you can, justify the regressions found in this try perf run in writing along with @rustbot label: +perf-regression-triaged. If not, fix the regressions and do another perf run. Neutral or positive results will clear the label automatically.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

mean range count
Regressions ❌
(primary)
0.4% [0.2%, 0.6%] 16
Regressions ❌
(secondary)
0.4% [0.2%, 0.6%] 10
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.4% [0.2%, 0.6%] 16

Max RSS (memory usage)

Results (secondary 2.3%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.3% [2.3%, 2.3%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Cycles

This perf run didn't have relevant results for this metric.

Binary size

Results (primary 0.1%, secondary 0.3%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 0.2%] 22
Regressions ❌
(secondary)
0.3% [0.0%, 0.4%] 8
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.1% [0.0%, 0.2%] 22

Bootstrap: 518.98s -> 524.616s (1.09%)
Artifact size: 400.92 MiB -> 400.88 MiB (-0.01%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

missed optimization when writing a value with inner padding.

5 participants