Redesigned Components iterator to use front and back indexing instead mutating and subslicing path field#156496
Redesigned Components iterator to use front and back indexing instead mutating and subslicing path field#156496asder8215 wants to merge 8 commits into
Components iterator to use front and back indexing instead mutating and subslicing path field#156496Conversation
|
rustbot has assigned @Mark-Simulacrum. Use Why was this reviewer chosen?The reviewer was selected based on:
|
This comment has been minimized.
This comment has been minimized.
1627e2f to
33e69e1
Compare
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
This comment has been minimized.
This comment has been minimized.
33e69e1 to
ed9d33d
Compare
This comment has been minimized.
This comment has been minimized.
ed9d33d to
0b0f84c
Compare
This comment has been minimized.
This comment has been minimized.
… of mutating and subslicing path field; as a result, Components iterator memory size goes from 64 bytes to 40 bytes and as_path does not use cloning at all
0b0f84c to
8ed33ea
Compare
This comment was marked as outdated.
This comment was marked as outdated.
2151b8f to
83cdbed
Compare
This comment was marked as outdated.
This comment was marked as outdated.
…ity, added safety comments, and check for root dir after Prefix component (e.g., '\\?\checkout\src\tools' should produce Prefix, RootDir, Normal, Normal, None, ...) in Components::parse_single_component
83cdbed to
3921fff
Compare
…ng iter().position()/rposition()
…here to use iter().position()/.iter().rposition(), refactored code in compare_components, and removed stale comments
0a25dda to
92e0132
Compare
|
New benchmarking results. You can see what the benchmark code looks like here and run it yourself to see if there are any difference in measurements on your end: This is the measurement of the current implementation of This is the measurement of the new implementation of Edit: Updated |
|
Here are the benchmark results with black box: From current From this Edit: Updated Edit 2: Took off Path ordering benchmark here since it was incorrect see below to see corrected path ordering benchmarks. |
92e0132 to
574d7f2
Compare
|
Yeah, I suspect that probably the best solution would be to do something similar to what Python does and offer some sort of I'll take any wins if the code ends up working better. |
|
I realized my benchmarking for path ordering is incorrect; I thought the |
|
New benchmarks for path ordering comparisons (BB abbrev for Black Box): Performance is nearly the same using what the current implementation of Edit: Had to correct the fast path |
f119f48 to
cb82f61
Compare
…e in previous implementation, but making it work with Components<'_> front index
cb82f61 to
1a25002
Compare
| // causes this function to run slower than using a variable that stores | ||
| // the `left.back` and `right.back` information (which `back` field | ||
| // encodes the length of the `Components<'_>` unconsumed path) | ||
| let left_back = left.back; |
There was a problem hiding this comment.
@clarfonthey I had to make an update to the None matching condition code here because the length that needed to be compared between the left and right Components<'_> are actually the back field not path.len() (could possibly make a mistake in the fast path on an existing Components<'_> that used Components::next_back). I updated the benchmarking code to reflect this change as well
However, I noticed a strange thing while benchmarking in that if I do:
None if left.back == right.back => { ... },
None => left.back.min(right.back),This runs two times slower than me storing left.back and right.back in separate variables and using that in the default None condition. Alternatively, I use left_back and right_back variables in both None matching conditions, it also causes a 2x performance degradation. Does this performance degradation occur on your end if you use left.back and right.back in both None match (or left_back and right_back)? If so, do you happen to know why this occurs?
I have the godbolt link here, but I couldn't figure out what changed.
There was a problem hiding this comment.
So this appears way more pronounced if you look at the generated MIR (post-opt):
For the "faster" case:
scope 1 {
debug left_back => _7;
let _8: usize;
scope 2 {
debug right_back => _8;
let _9: usize;
let _24: usize;
scope 3 {
debug first_difference => _9;
scope 5 {
debug previous_sep => _32;
let _32: usize;
}
}
scope 4 {
debug diff => _24;
}
}
}For the "slower" case:
scope 1 {
debug first_difference => _17;
scope 3 {
debug previous_sep => _22;
let _22: usize;
scope 35 (inlined #[track_caller] core::slice::index::<impl Index<RangeTo<usize>> for [u8]>::index) {
debug self => _31;
debug ((index: RangeTo<usize>).0: usize) => _17;
scope 36 (inlined #[track_caller] <RangeTo<usize> as SliceIndex<[u8]>>::index) {
debug ((self: RangeTo<usize>).0: usize) => _17;
debug slice => _31;
scope 37 (inlined #[track_caller] <std::ops::Range<usize> as SliceIndex<[u8]>>::index) {
debug ((self: std::ops::Range<usize>).0: usize) => const 0_usize;
debug ((self: std::ops::Range<usize>).1: usize) => _17;
debug slice => _31;
debug new_len => _17;
let mut _62: bool;
let mut _63: usize;
let _64: *const [u8];
let mut _65: *const [u8];
let mut _66: !;
let mut _67: usize;
scope 38 (inlined core::num::<impl usize>::checked_sub) {
debug self => _17;
debug rhs => const 0_usize;
let mut _68: bool;
}
scope 39 (inlined core::slice::index::get_offset_len_noubcheck::<u8>) {
debug ptr => _31;
debug offset => const 0_usize;
debug len => _17;
let mut _69: *const u8;
scope 40 {
scope 41 {
}
}
}
}
}
}
scope 42 (inlined core::slice::<impl [u8]>::iter) {
debug self => _64;
scope 43 (inlined std::slice::Iter::<'_, u8>::new) {
debug slice => _64;
let mut _71: std::ptr::NonNull<[u8]>;
let mut _73: *mut u8;
let mut _74: *mut u8;
scope 44 {
debug len => _17;
let _70: std::ptr::NonNull<u8>;
scope 45 {
debug ptr => _70;
let _72: *const u8;
scope 46 {
debug end_or_len => _72;
}
scope 50 (inlined std::ptr::without_provenance::<u8>) {
debug addr => _17;
scope 51 (inlined without_provenance_mut::<u8>) {
}
}
scope 52 (inlined NonNull::<u8>::as_ptr) {
debug self => _70;
}
scope 53 (inlined #[track_caller] std::ptr::mut_ptr::<impl *mut u8>::add) {
debug self => _74;
debug count => _17;
}
}
scope 47 (inlined NonNull::<[u8]>::from_ref) {
debug r => _64;
let mut _75: *const [u8];
}
scope 48 (inlined NonNull::<[u8]>::cast::<u8>) {
debug self => _71;
let mut _76: *mut u8;
let mut _77: *mut [u8];
scope 49 (inlined NonNull::<[u8]>::as_ptr) {
}
}
}
}
}
}
}I have a feeling that this might have something to do with how match guards are generated, although this is genuinely very weird. Will bring up to some compiler folks on Zulip and see if they have any insights.
There was a problem hiding this comment.
Also, the Zulip thread in case you want to participate: #t-compiler/performance > Bindings change dramatically affecting generated MIR
For now, I would say to obviously go with whichever one gets better optimised, but it would be really interesting to figure out why this is being compiled so differently.
There was a problem hiding this comment.
The slower case looks so horrendous; it really is strange to see how reusing the Components<'_> iterator field leads to this mess of generated code when it's just due to how you extract the back indices of each Components<'_> (via variables vs from the struct directly). I'm curious what goes on in match guard code generation.
I'll definitely keep my eyes on the Zulip thread, appreciate you linking it here!
|
Might as well: r? @clarfonthey For now since I have more or less agreed to review this. Will hand over to someone else if there are any additional things that need to be resolved that I can't/shouldn't handle. |
| fn prefix_verbatim(&self) -> bool { | ||
| if !HAS_PREFIXES { | ||
| return false; | ||
| fn consume_first_component(&mut self, dir_front: bool) -> Option<Component<'a>> { |
There was a problem hiding this comment.
So, specifically since the first component appears to already be eagerly evaluated, I do wonder if this method is really necessary or if we should simply make the first component store Option<Component<'a>> directly. It does feel like a bit of extra work that could be cut out for simplicity, but I might be misreading.
There was a problem hiding this comment.
Note that my general opinion on eager evaluation for this kind of iterator is that if eager evaluation dramatically simplifies a majority of the cases that use this code, we can afford a little bit of eager evaluation as a treat, even if there are a few cases where something might be done that is ultimately discarded later.
There was a problem hiding this comment.
We could also evaluate this function inside the Components::next and Components::next_back cases (it'll also get rid of the dir_front argument). I think I just did it earlier to write it somewhat of shared code in a neater way (although, they are not exactly shared since certain match conditions have different effects whether dir_front is true of false).
I'll change this and put the the code directly inside Components::next and Components::next_back and benchmark again to see if that affects anything.
There was a problem hiding this comment.
With storing an Option<Component<'a>>, my concern was on increasing the memory size of Components<'_> and whether that would be worth it. I'm pretty sure one of the Component enum member takes in a Prefix, which that enum takes up 40 bytes. I also wanted to reduce the size of Components<'_> with this front and back index approach since I know cloning occurs in Components<'_> comparison (for equality or ordering).
I think that's one of the benefits I was trying to make with this front and back index approach. That this approach compresses the size penalty we take with storing Prefix enum into Components<'_> (it was like why use a Prefix enum that takes up 40 bytes, when we could use a usize that serves as an index marker on where our Prefix length ends?).
The other thing is that while FirstComponent::Absolute and FirstComponent::Prefix is evaluated already via has_root or parse_prefix, the relative path first component is not eagerly evaluated. I wasn't too worried about FirstComponent::Absolute, though it does suck to see FirstComponent::Prefix get evaluated again (especially if you have PrefixVerbatim and it's a pretty big component).
I'm okay with storing a Option<Component<'a>> here, but do we find the increase on Components<'_> iterator size acceptable?
There was a problem hiding this comment.
I actually hadn't fully taken in how big the Prefix enum is, but… yeah.
That said, given the complexity of the operation, I still think that it might be better to still do everything ahead of time, minus maybe parsing a Prefix. This basically means that I think it might be better to effectively convert the iterator into a monomorphised version of Chain<MaybePrefix, Back>, where even if MaybePrefix does some extra parsing at runtime to fully expand the prefix, the iterator starts out in a form where you have definitively separated out the prefix and don't need any special casing besides either (doing whatever logic is required to output the prefix) or (doing whatever logic is required for everything else). Right now, because the front index is doing double-duty for both keeping track of the location of the prefix, and keeping track of the iteration position in the rest of the path, it might be better to instead duplicate the extra index if needed in the Option being stored just to simplify the logic.
Also, as far as size goes… unless it actively messes with codegen, I would say we should basically be assuming in all cases that people will be storing a Path, not a Components iterator, at least modulo any iterator adapters. If we still end up in the case where the iteration can't be inlined, it makes sense to optimise for this size, but hopefully these changes fix that.
There was a problem hiding this comment.
That said, given the complexity of the operation, I still think that it might be better to still do everything ahead of time, minus maybe parsing a
Prefix.
I agree with you there.
This basically means that I think it might be better to effectively convert the iterator into a monomorphised version of
Chain<MaybePrefix, Back>, where even ifMaybePrefixdoes some extra parsing at runtime to fully expand the prefix, the iterator starts out in a form where you have definitively separated out the prefix and don't need any special casing besides either.
I feel like I might need a bit of clarity on what you mean here.
However, I was thinking about something a bit simpler. I was thinking back to your point on storing an Option<Component<'a> into Components<'_>, and instead of an Option<Component<'a>, I was thinking we can add the PrefixComponent<'a> from Component<'a> as field of FirstComponent::Prefix. What's eagerly evaluated when creating a Components<'_> iterator is Prefix and whether we have an absolute path or not (which the latter should be trivial to compute, and in both cases, this would be trivial to compute on unix platforms since Prefix doesn't exist). The first component of a relative path is not eagerly evaluated and I don't think we need to do that if it's not necessary.
The benefit of just adding PrefixComponent<'a> into FirstComponent::Prefix instead of using Option<Component<'a>> is just that the size of the FirstComponent enum will be smaller as it doesn't add the Normal(&'a OsStr) enum member (and, a lesser issue, all the other enum members) that Component<'a> has (note: size argument isn't true, it's just less convoluted). This would mitigate the issue of re-parsing the Prefix occurring in this function (and elsewhere) as we can just take and move it out from Option<FirstComponent::Prefix> into Component::Prefix.
Right now, because the front index is doing double-duty for both keeping track of the location of the prefix, and keeping track of the iteration position in the rest of the path, it might be better to instead duplicate the extra index if needed in the
Optionbeing stored just to simplify the logic.
I think the front index is fine doing double-duty. By my previous suggestion, after parsing the Prefix and storing it inside FirstComponent::Prefix, we can still have the front index start at the length of the Prefix for the next component(s) it needs to parse.
|
@clarfonthey I tried out incorporating the I didn't see much of difference between the MIR code without storing With Without Does this have to do with how the size of the Godbolt links attached: Also note: I'm running on Fedora Linux. I have not benchmarked the code on Windows. |
|
@clarfonthey any note on the performance discrepancy with using |
|
@clarfonthey I was thinking about this for a while, but do you think it'd be possible to add in an internal unsafe I think having that function would allow me to insert in one more enum PrefixTag {
Verbatim,
VerbatimUNC,
VerbatimDisk,
DeviceNS,
UNC,
Disk
}
enum FirstComponent {
AbsolutePath,
RelativePath,
Prefix(PrefixTag),
}
pub struct Components<'a> {
path: &'a [u8],
// Probably a better name for this field, but this is only
// used for Prefix matching cases to construct a Prefix
// without the parsing process (only used if our Prefix
// is VerbatimUNC/UNC)
first_ind: usize,
front: usize,
back: usize,
has_physical_root: bool,
first_comp: Option<FirstComponent>,
} |
|
So, just adding some follow-ups from the libs meeting + my general thoughts on this. Been meaning to get to this review with a bit of a finer comb but haven't yet. In general, the team is in favour of improving I personally think there are lots of potential wins here that can be done without any performance regressions. So, I think probably a hard line on accepting this PR is to ensure that there is a unilateral improvement across the board, which might limit what you're able to do from an API perspective. (side note: the perf runs are testing windows too, right?) Additionally, any local testing or benchmarking you've managed to do are extremely good to document here, and can be at least brought to the table for future ACPs/perf improvements. Folks on the libs team were particularly interested in cargo runtime benchmarks, since we know cargo does a lot of operations on paths and can build up a hefty amount of overhead, and right now I believe that perf is mostly measuring crate performance, not the performance of cargo itself. Also got some additional clarification from @the8472 that we shouldn't be running any branches for Windows prefixes on non-Windows platforms, but that I don't actually think this is the case. Going to poke around the docs later to see if we formally guarantee this, since I think we should, but don't actually think we do. |
I'm down to create an ACP for this. It may take me some time to write up a clean ACP draft, but I think the general idea from this is to switch strategy from mutating/subslicing path field directly to using an indexing strategy to grab a subslice of path.
I'll see what else I can do to get rid of perf regressions particularly on impl<'a> DoubleEndedIterator for Components<'a> {
fn next_back(&mut self) -> Option<Component<'a>> {
// We reach here when we no longer have anymore paths
// to consume, or we need to output Prefix component
// (anything else falls through this conditional)
if self.back <= self.front {
return self.consume_first_component(false);
}
self.parse_next_back_component()
}
}
impl<'a> Iterator for Components<'a> {
type Item = Component<'a>;
fn next(&mut self) -> Option<Component<'a>> {
// We reach this case when we no longer have anymore paths
// to consume (return `None`), or if our front idx was initially
// equal to back idx (e.g. if we had `C:`, `.`, `/`), or if we
// had a front component initially
if self.front >= self.back || self.first_comp.is_some() {
return self.consume_first_component(true);
}
self.parse_next_component()
}
}There's an extra check in In an earlier comment, I mentioned that I ran the benchmarks on Fedora Linux. I could try to run the same benchmark tests on a Windows VM; however, I haven't tested how it works with paths with prefixes. I could also logically copy what
Noted. I could write that down in the ACP for sure.
I can try to run cargo runtime benchmarks, but how do I go about doing that? Are there benchmark tests within the rust repo that I can run, or should I be cloning cargo and running benchmarks there?
I agree with @the8472, and I've been thinking about whether cfg gating the |
|
So to be clear, we genuinely have no benchmarks that just run cargo, so, this is mostly just a "this is what we've been thinking about" thing. If you want to try building cargo and running it on a large project before/after your change as a sniff test that could be maybe useful, but I'm expecting that to not really be very fruitful. Not sure how rustc-timing works but maybe that helps. Also to be clear on the ACP thing: specifically, ACPs are for API changes, so the idea would be to potentially propose some of the internal types you've been using like " I want to propose |
Got you. I'm pretty unfamiliar with accurately benchmarking a whole build, but I'll see what I can do. On an adjacent note, I've been thinking about the performance degradation thing on
That sounds good to me; in that case, I'll leave those ACP proposals to you because I think this PR isn't really introducing any new API changes and just modifying the implementation of an existing API. |
|
It's entirely possible that your latest version of the code got rid of the regressions, and based upon your work, I'm pretty hopeful it'll be possible to do the refactor without any regressions, although I do want to take a deeper look at the code and compare to the original before requesting another perf run. I also was going to take a look at some of the documented guarantees on Windows paths being parsed on Linux before I can fully clarify the status on the cfg'd code. My understanding is that There is technically the chance for a brute-force approach, where you create a separate version of |
|
All good, take your time in looking at this and finding out what else could be done here. I pushed the changes I made from components_redesign with separating |
…nt_front and consume_first_component_back. Also introduced aggressive inlining.
8585ad0 to
8111c2c
Compare
…comparison (normalizing paths if needed) and return Ordering Equal/Greater/Less if possible before needing to fall back on Iterator::cmp
|
Pushed another commit to optimize Benchmarks as follow: This PR implementation: The current std implementation: Do note that running |
|
@clarfonthey I tried building cargo via The thing that concerns me is that when I use my native cargo binary from my nightly rust toolchain, it builds tokio in 6-7s. Is there a certain optimized config to build cargo that's not made in --stage 2 builds? |
|
The idea is to mostly treat |
|
I built the library components with release flag turned on, and it decreased the cargo build time speed on tokio from 17-18s -> 10-11s. However, I'm also not sure if this perfectly replicates the optimizations that my Are there any differences to optimization settings between rustup cargo binaries and cargo binary built from the rust repo? If not, then I think a lot of concern might be towards |
|
I'll build the rust repo main branch (without my |
|
@clarfonthey Yeah, I feel like I'm missing something. I just built the main branch of the rust repo on stage 2 with release profile (same thing with cargo) and it gives me the same/similar 10-11s timing results with cargo build with that toolchain. |
View all comments
This PR entirely changes how
Components<'_>is implemented. Currently, theComponents<'_>iterator 'consumes' components through mutating its path field to a subslice that presents the left over unconsumed path components (this consumed path component is what's returned inComponents::nextorComponents::next_back). However, this PR keeps the path field alive/unmodified and uses front and back indexing strategy to extract consumed/unconsumed components.This PR benefits implementations like
Components::as_path, which is pretty used is multiple areas of the standard library. Previously,Components<'_>iterator was required to clone inside the function to present the unconsumed path because our originalComponent<'_>consuming behavior on path will not allow the returned&'a PathfromComponents::as_pathto last after aComponents::nextorComponents::next_backcall. Due to the current implementation ofComponentsiterator has a size of 64 bytes, if you're usingComponents::as_pathafter eachComponents::next/Components::next_back, then it's pretty unfortunate to be cloning 64 bytes again and again, especially if each of your path components are a few bytes (e.g., "foo/bar/baz").On the point of size, with the indexing strategy, this PR has further optimized the size of
Components<'_>from 64 bytes -> 40 bytes since a large chunk of theComponents<'_>was taken up by theOption<Prefix>(this takes up 40 bytes). Instead of holding a prefix field inComponents<'_>, we can encode the length of thePrefixwithin ourfrontfield index and use another enum calledFirstComponentto check whether our first component of the given path isPrefix(or something else). If it's aPrefix, we can useparse_prefixon the subsliceself.path[..self.front]since we know our front index encodes thePrefixlength.Due to not having the prefix
Option<Prefix>field insideComponents<'_>anymore, all the prefix functions inComponents<'_>have been removed in favor of callingparse_prefix,Prefix::is_verbatim,Prefix::is_drive, etc.I'm curious if this redesign of
Components<'_>improves Path equality as pointed out by @clarfonthey in #154521 with Path equality (not to be confused with Path ordering as mentioned in the issue, since that usesComponents:::compare_componentsand the example code shows equality) being slow.I haven't benchmarked this though.I have benchmarked the result and I can say that currently this implementation improves Path equality due toComponents::next_backrunning faster with this implementation than the current mutating path with a subslice implementation. However, Path ordering runs slightly slower. You can check the benchmark code I used here, and play around with the number of bytes in a component, the number of components, etc..Right now, when I tested it locally on my PC (Fedora OS), it passed all the standard library tests and rust analyzer didn't crash on me (had a few crash reports coming from rust analyzer early on when I messed around with
Components<'_>dealing something with threads usingPath::components, but now that's all resolved).I have not tested this on Windows yet, and I would probably need someone to help me test on this platform as my Windows VM is not working properly to run the standard library test suite.There's a lot of things being done here, and possibly there may be better approaches or ways I could improve this implementation or write the code in a neater way here. I am open to any advice or feedback on this approach.
Update: I got to testing some things out with Prefixes on my Windows VM manually, so the prefix component index encoded into the
Components<'_>front field seems to work out nicely. I've also accounted for root directory being able to exists after a Prefix component like "\?\checkout\src\tools" having the following components:PrefixVerbatim->RootDir->Normal->Normal->None(learnt this from the fail that occurred in miri tests, which is nice to see thisComponents<'_>implementation works on the Windows tests in CI).