Skip to content

Commit ad5256b

Browse files
committed
docs: update rebalance.md (add cleanup mode; examples)
Signed-off-by: Alex Aizman <alex.aizman@gmail.com>
1 parent 0f7d73c commit ad5256b

3 files changed

Lines changed: 211 additions & 43 deletions

File tree

‎docs/rebalance.md‎

Lines changed: 151 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -8,25 +8,26 @@ Global rebalance is one of the core AIS mechanisms that makes it possible to gro
88

99
**Table of Contents**
1010

11-
- [Placement: HRW and the three inputs](#placement-hrw-and-the-three-inputs)
12-
- [What triggers global rebalance](#what-triggers-global-rebalance)
13-
- [How global rebalance works](#how-global-rebalance-works)
14-
- [Serving reads during migration](#serving-reads-during-migration)
15-
- [Control and monitoring](#control-and-monitoring)
16-
- [CLI: usage examples](#cli-usage-examples)
17-
- [Starting rebalance administratively](#starting-rebalance-administratively)
18-
- [Rebalance vs. resilver](#rebalance-vs-resilver)
19-
- [Performance considerations](#performance-considerations)
11+
* [Placement: HRW and the three inputs](#placement-hrw-and-the-three-inputs)
12+
* [What triggers global rebalance](#what-triggers-global-rebalance)
13+
* [How global rebalance works](#how-global-rebalance-works)
14+
* [Serving reads during migration](#serving-reads-during-migration)
15+
* [Control and monitoring](#control-and-monitoring)
16+
* [CLI: usage examples](#cli-usage-examples)
17+
* [Starting rebalance administratively](#starting-rebalance-administratively)
18+
* [Cleanup mode](#cleanup-mode)
19+
* [Rebalance vs. resilver](#rebalance-vs-resilver)
20+
* [Performance considerations](#performance-considerations)
2021

2122
## Placement: HRW and the three inputs
2223

2324
AIStore uses a variant of **highest random weight (HRW)**, also known as rendezvous hashing, to determine object placement.
2425

2526
At the cluster level, the destination target for an object is uniquely determined by the following three inputs:
2627

27-
- the current **cluster map**
28-
- the fully qualified **bucket**, including provider and namespace
29-
- the **object name**
28+
* the current **cluster map**
29+
* the fully qualified **bucket**, including provider and namespace
30+
* the **object name**
3031

3132
This is the key to understanding rebalance.
3233

@@ -40,15 +41,15 @@ Global rebalance is triggered by changes that affect cluster-wide target placeme
4041

4142
Typical examples include:
4243

43-
- a storage target joining the cluster
44-
- a storage target leaving the cluster
45-
- putting a target into maintenance
46-
- decommissioning a target
47-
- bringing a target back into active service
44+
* a storage target joining the cluster
45+
* a storage target leaving the cluster
46+
* putting a target into maintenance
47+
* decommissioning a target
48+
* bringing a target back into active service
4849

4950
A useful rule of thumb is:
5051

51-
> if a topology change can alter the HRW destination for stored objects, it can trigger global rebalance.
52+
> if a topology change can alter the destination for stored objects, it can trigger global rebalance.
5253
5354
When a single target is added to or removed from a cluster of `N` targets, the fraction of objects that move is typically on the order of `1/N`, though the exact amount always depends on the topology and the current object distribution.
5455

@@ -67,12 +68,14 @@ At a high level:
6768
5. If the local target is no longer the correct owner, the object is sent directly to the proper target.
6869
6. The process completes when all participating targets finish migrating the objects that no longer belong locally.
6970

71+
The steps above describe regular, data-moving rebalance. Cleanup mode, described below, reuses the rebalance lifecycle but does not migrate object payloads.
72+
7073
A few points are worth emphasizing:
7174

72-
- there is no central data-movement coordinator that decides ownership object by object
73-
- each target independently evaluates the objects it currently stores
74-
- object migration is performed via AIS intra-cluster transfers
75-
- when provisioned, rebalance traffic can use separate intra-cluster networking
75+
* there is no central data-movement coordinator that decides ownership object by object
76+
* each target independently evaluates the objects it currently stores
77+
* object migration is performed via AIS intra-cluster transfers
78+
* when provisioned, rebalance traffic can use separate intra-cluster networking
7679

7780
This design keeps rebalancing scalable and avoids turning the primary into a data path bottleneck.
7881

@@ -86,24 +89,24 @@ In particular, the target that must own an object according to the **new** clust
8689

8790
As a result:
8891

89-
- applications do not need to stop I/O while rebalance is running
90-
- object movement remains transparent to clients
91-
- the cluster can continue converging toward the new placement while still serving reads
92+
* applications do not need to stop I/O while rebalance is running
93+
* object movement remains transparent to clients
94+
* the cluster can continue converging toward the new placement while still serving reads
9295

9396
## Control and monitoring
9497

9598
Global rebalance is controlled and monitored via:
9699

97-
- native HTTP-based APIs
98-
- the [Go API](https://github.com/NVIDIA/aistore/tree/main/api)
99-
- the [Python SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore/sdk)
100-
- the [Command Line Interface (CLI)](/docs/cli.md)
100+
* native HTTP-based APIs
101+
* the [Go API](https://github.com/NVIDIA/aistore/tree/main/api)
102+
* the [Python SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore/sdk)
103+
* the [Command Line Interface (CLI)](/docs/cli.md)
101104

102105
Operationally, administrators typically care about three things:
103106

104-
- whether automated rebalance is enabled
105-
- whether a rebalance is currently running
106-
- how much data each target has sent and received for the current rebalance
107+
* whether automated rebalance is enabled
108+
* whether a rebalance is currently running
109+
* how much data each target has sent and received, or removed in cleanup mode
107110

108111
Like other long-running AIS activities, rebalance is tracked as a cluster job and can be inspected while in progress.
109112

@@ -226,6 +229,11 @@ USAGE:
226229
ais start rebalance [BUCKET[/PREFIX]] [command options]
227230

228231
OPTIONS:
232+
cleanup Remove local copies of misplaced objects - monolithic and chunked (non-EC);
233+
fails if rebalance is running; incompatible with '--latest' and '--sync'
234+
force,f With '--cleanup': also remove local misplaced copies that fail the safe identity check against copies
235+
at their expected locations; will not run concurrently with active rebalance/resilver
236+
(caution: advanced usage only)
229237
latest Check in-cluster metadata and, possibly, GET, download, prefetch, or otherwise copy the latest object version
230238
from the associated remote bucket;
231239
the option provides operation-level control over object versioning (and version synchronization)
@@ -253,6 +261,115 @@ OPTIONS:
253261
help, h Show help
254262
```
255263

264+
For cleanup mode, see the next section.
265+
266+
## Cleanup mode
267+
268+
Rebalance can also run in **cleanup mode**:
269+
270+
```console
271+
$ ais start rebalance --cleanup
272+
```
273+
274+
Cleanup mode is an administrative maintenance operation. It reuses the rebalance lifecycle and monitoring machinery, but it does **not** migrate object payloads between targets.
275+
276+
### Why cleanup mode was introduced
277+
278+
Versions 4.4 and earlier tracked every migrated object with per-object acknowledgments from destination to source, and used those acknowledgments to delete the source copy once placement was confirmed at the destination. That implicit reclamation mechanism did not scale to clusters and buckets with billions of objects and was removed.
279+
280+
As a result, regular rebalance no longer reclaims source-side copies implicitly. After a topology change converges, the cluster may continue to hold local copies of objects whose proper owner is now a different target. These copies are not lost data and they do not affect correctness, but they do consume local capacity until something reclaims them.
281+
282+
Cleanup mode is the explicit, operator-driven replacement: a separate verified pass - rebalance retracing its own steps - that discovers misplaced local copies and removes only those whose proper owner already has the object.
283+
284+
> For broader local-storage hygiene, AIS also provides `ais space-cleanup`. That tool can remove several classes of local garbage, including corrupted metadata files, zero-size objects when requested, extra local copies, misplaced EC artifacts, local mountpath orphans, and verified migrated-away leftovers. Rebalance's cleanup mode is narrower: it is the placement-specific, rebalance-lifecycle mode intended for cleaning up source-side copies left after topology changes and regular data-moving rebalance.
285+
286+
### How cleanup mode operates
287+
288+
Each target walks its local mountpaths and looks for object copies that no longer belong on that target according to the current cluster map. For every local object, AIS recomputes the expected location. If the local target is already the expected owner, the object is skipped.
289+
290+
For a misplaced local copy, AIS contacts the expected owner and requests object properties used to establish identity - size, checksum, version, custom metadata,
291+
and ETag. Different byte content means a different version: two copies with the same name but divergent metadata are not the same object.
292+
293+
The local copy is removed only when AIS can verify that the expected owner holds the same version.
294+
295+
In other words, regular rebalance converges placement by moving objects to their proper targets. Cleanup mode converges local storage by removing misplaced copies that are already present at their proper targets.
296+
297+
Cleanup mode is intentionally out-of-band. Regular data-moving rebalance can temporarily create extra local copies while the cluster converges, but tracking every migrated object at runtime would not scale for large clusters and buckets with millions or billions of objects. Cleanup mode therefore performs a separate verified pass: it discovers misplaced local copies from the current on-disk namespace and safely removes only those that are already present at their expected locations.
298+
299+
### Default behavior is conservative
300+
301+
By default, cleanup mode:
302+
303+
* removes only local misplaced copies
304+
* skips objects that already belong on the local target
305+
* skips EC buckets entirely
306+
* skips objects with local mirror copies (`mirror.enabled=true`)
307+
* keeps objects that cannot be verified at their expected location
308+
* keeps objects whose local metadata differs from the expected owner's metadata
309+
* does not run concurrently with active rebalance or resilver
310+
311+
Cleanup mode is useful after operational workflows such as maintenance, rolling upgrades, or recovery procedures where misplaced local copies may remain and an administrator wants to reclaim local capacity without running a full data-moving rebalance.
312+
313+
Cleanup mode can be bucket-scoped and prefix-scoped, similarly to administrative rebalance. It is incompatible with `--latest` and `--sync`.
314+
315+
### Monitoring
316+
317+
Cleanup mode can be monitored with the usual rebalance commands:
318+
319+
```console
320+
$ ais show rebalance
321+
$ ais show job
322+
```
323+
324+
When cleanup mode is running or has completed, reported counters describe objects removed and bytes reclaimed rather than objects sent and received.
325+
326+
### Example: cleanup after returning a target to service
327+
328+
The following abbreviated example shows a three-target cluster where `t[VCft8081]` has been returned from maintenance. Regular rebalance `g23` first moves objects according to the updated cluster map. Cleanup rebalance `g24` then removes leftover misplaced local copies from the other targets.
329+
330+
```console
331+
$ ais show job
332+
rebalance[g23] (ctl: t[gsCt8083]:<fin-streams> trav:2s post-trav:2s fin:1m2s fin-streams:5s)
333+
NODE ID KIND TX OBJECTS TX BYTES RX OBJECTS RX BYTES START END STATE
334+
gsCt8083 g23 rebalance 107663 107.61MiB - - 17:13:55 - Running
335+
------------------------------------------------------------------------
336+
rebalance[g24] (ctl:
337+
flags:cleanup; t[gsCt8083]:cleanup visits=12678 loads=12678 removed=12678
338+
flags:cleanup; t[vTnt8082]:cleanup visits=12605 loads=12605 removed=12605
339+
)
340+
```
341+
342+
Note that `t[VCft8081]` does not appear in the cleanup output. Having just returned to service under the current cluster map, it holds no misplaced copies - every local object is HRW-correct from its perspective. Only the targets that had to send data during `g23` carry misplaced leftovers.
343+
344+
Cleanup-specific rebalance output reports objects removed and bytes reclaimed:
345+
346+
```console
347+
$ ais show rebalance
348+
REB ID NODE REMOVED OBJECTS REMOVED BYTES START END STATE
349+
g24 gsCt8083 12678 12.38MiB 17:15:16 - Running
350+
g24 vTnt8082 12605 12.31MiB 17:15:16 - Running
351+
g24: 25283 objects removed (total size 24.7MiB)
352+
```
353+
354+
### Forced cleanup
355+
356+
The `--force` option is valid only with cleanup mode:
357+
358+
```console
359+
$ ais start rebalance --cleanup --force
360+
```
361+
362+
Forced cleanup is advanced usage. To explain what it does, recall the default identity check: cleanup removes a misplaced local copy only when the expected owner reports identical metadata (size, checksum, version, ETag, custom metadata).
363+
364+
When local metadata diverges from the expected owner's metadata, the two copies are not byte-identical - same name, different content. Concretely, this can happen with a raced write, an out-of-band update at the remote backend, or a stale pre-overwrite leftover. By default, cleanup *keeps* such divergent local copies, because removing one of two non-identical copies is data loss for whoever happens to hold the version that gets deleted.
365+
366+
`--force` removes them anyway, treating the HRW owner's copy as authoritative. Use forced cleanup only when you have established that the divergent local copy is the one to discard.
367+
368+
Two things `--force` does **not** do:
369+
370+
* it does not skip the HRW verification step - AIS still confirms the expected owner has *some* copy of the object before removing the local one;
371+
* it does not override safety windows (such as `dont_cleanup_time`) and does not allow cleanup to run concurrently with active rebalance or resilver.
372+
256373
## Rebalance vs. resilver
257374

258375
Both rebalance and resilver restore HRW-based placement, but they do so at different scopes.
@@ -287,3 +404,5 @@ The exact overhead depends on multiple factors, including:
287404
* whether separate intra-cluster networking is provisioned
288405

289406
For this reason, administrators often tune rebalance behavior and may temporarily disable automated rebalance while performing planned maintenance or staged upgrades.
407+
408+
Cleanup mode has a different resource profile: it does not migrate object payloads, but it still walks local namespace entries, loads object metadata, performs intra-cluster verification, and removes local files. It should therefore still be treated as a background maintenance operation.

‎reb/README.md‎

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,22 @@ streaming. Because the trname is fixed, at most one DM may be registered against
55
the transport at any moment. This document describes how that constraint is
66
maintained across rebalance generations.
77

8+
> Cleanup mode is a separate rebalance generation mode; it reuses the `*Reb`
9+
lifecycle but does not open data streams. See [Cleanup mode](#cleanup-mode).
10+
Unless explicitly stated otherwise, the lifecycle and invariants below describe
11+
regular, data-moving rebalance generations that use DM/transport streaming.
12+
813
## Lifecycle
914

1015
The `*Reb` service is constructed once at target startup and reused across all
11-
rebalance generations. The DM, however, is **per-generation**: a new DM is
12-
constructed at the start of each `Run()` that needs streams (single-node cluster
13-
doesn't) - and torn down before `Run()` returns.
16+
rebalance generations.
17+
18+
The DM, however, is **per-streaming generation**: a new DM is
19+
constructed at the start of each `Run()` that needs streams, and torn down before
20+
`Run()` returns.
21+
22+
> When rebalance runs in cleanup mode it certainly does not open streams.
23+
1424

1525
```
1626
New()
@@ -74,3 +84,18 @@ Under degraded disks or heavy load, abort propagation alone can approach this bo
7484
In the end, the timeout value is a compromise: long enough to cover
7585
typical cleanup, short enough that Smap flicker (when nodes keep leaving and (re)joining)
7686
doesn't stack waiters.
87+
88+
## Cleanup mode
89+
90+
Cleanup mode is a rebalance generation that reuses the `*Reb` lifecycle but does
91+
not open data streams and does not migrate object payloads.
92+
93+
The motivation is scalability. A regular data-moving rebalance may temporarily
94+
leave extra local copies while the cluster converges. Tracking every migrated
95+
object at runtime, only to remove the old copy later, would not scale for large
96+
clusters and buckets with millions or billions of objects.
97+
98+
Cleanup mode is therefore out-of-band. It performs a separate local walk,
99+
recomputes the expected HRW owner for each object, verifies the object at that
100+
expected location, and removes the local misplaced copy only when it is safe to
101+
do so.

‎space/README.md‎

Lines changed: 32 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,11 @@ it does not coordinate across targets.
1010
**Table of Contents**
1111

1212
1. [Overview](#1-overview)
13-
2. [Cleanup Policies](#2-cleanup-policies)
14-
3. [Implementation Details](#3-implementation-details)
15-
4. [Corner Cases & Constraints](#4-corner-cases--constraints)
16-
5. [Future Enhancements](#5-future-enhancements)
13+
2. [Relation to rebalance cleanup mode](#relation-to-rebalance-cleanup-mode)
14+
3. [Cleanup Policies](#2-cleanup-policies)
15+
4. [Implementation Details](#3-implementation-details)
16+
5. [Corner Cases & Constraints](#4-corner-cases--constraints)
17+
6. [Future Enhancements](#5-future-enhancements)
1718

1819
## 1. Overview
1920

@@ -36,7 +37,30 @@ Any file with `mtime + dont_cleanup_time > now` is skipped to avoid racing again
3637

3738
Invalid entries (malformed FQNs, bucket mismatches) are logged and removed.
3839

39-
## 2. Cleanup Policies
40+
## 2. Relation to rebalance cleanup mode
41+
42+
`ais space-cleanup` is a general local-storage cleanup tool. It walks local
43+
mountpaths and removes several classes of safely reclaimable files, including
44+
objects with corrupted or missing local metadata, zero-size objects when
45+
configured, extra local copies, misplaced EC artifacts, local mountpath orphans,
46+
and verified migrated-away leftovers.
47+
48+
AIStore also provides:
49+
50+
```console
51+
$ ais start rebalance --cleanup
52+
````
53+
54+
Rebalance cleanup is narrower and more explicit. It reuses the global rebalance lifecycle and monitoring machinery, does not migrate object payloads, and is
55+
intended specifically for reclaiming source-side copies left after topology changes and regular data-moving rebalance.
56+
57+
In short:
58+
59+
* use `ais start rebalance --cleanup` when the goal is post-rebalance,
60+
placement-specific cleanup after maintenance, decommission, scale-out, scale-in, or node return (from maintenance);
61+
* use `ais space-cleanup` for broader local-storage hygiene and capacity reclamation.
62+
63+
## 3. Cleanup Policies
4064

4165
### Work Files (`fs.WorkCT`)
4266

@@ -79,7 +103,7 @@ Behavior depends on whether EC is enabled for the bucket:
79103
- Handled in `visitObj()`
80104
- For EC-enabled buckets: objects missing corresponding metafiles flagged as *misplaced EC*
81105

82-
## 3. Implementation Details
106+
## 4. Implementation Details
83107

84108
### Throttling
85109

@@ -92,14 +116,14 @@ Space cleanup uses the unified `cmn/load` throttling (`load.Advice`) to avoid I/
92116
### Time Dependencies
93117
Relies on filesystem mtimes. Clock changes on the operator may influence cleanup decisions.
94118

95-
## 4. Corner Cases & Constraints
119+
## 5. Corner Cases & Constraints
96120

97121
- **Race Protection**: Slice => Meta and Replica => Meta sequences covered by global recency guard
98122
- **Local Scope**: Does not consult cluster maps; global orphan detection is out of scope
99123
- **Encoding Requirements**: `fs.WorkCT` tags, chunk uploadIDs, and chunk numbers must never be empty
100124
- **Legacy State**: Partial manifests treated as invalid and always removed
101125

102-
## 5. Future Enhancements
126+
## 6. Future Enhancements
103127

104128
### Generation-Aware EC Cleanup
105129
Delay removal when conflicting generations exist; prefer newest metadata.

0 commit comments

Comments
 (0)