You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AIStore uses a variant of **highest random weight (HRW)**, also known as rendezvous hashing, to determine object placement.
24
25
25
26
At the cluster level, the destination target for an object is uniquely determined by the following three inputs:
26
27
27
-
- the current **cluster map**
28
-
- the fully qualified **bucket**, including provider and namespace
29
-
- the **object name**
28
+
* the current **cluster map**
29
+
* the fully qualified **bucket**, including provider and namespace
30
+
* the **object name**
30
31
31
32
This is the key to understanding rebalance.
32
33
@@ -40,15 +41,15 @@ Global rebalance is triggered by changes that affect cluster-wide target placeme
40
41
41
42
Typical examples include:
42
43
43
-
- a storage target joining the cluster
44
-
- a storage target leaving the cluster
45
-
- putting a target into maintenance
46
-
- decommissioning a target
47
-
- bringing a target back into active service
44
+
* a storage target joining the cluster
45
+
* a storage target leaving the cluster
46
+
* putting a target into maintenance
47
+
* decommissioning a target
48
+
* bringing a target back into active service
48
49
49
50
A useful rule of thumb is:
50
51
51
-
> if a topology change can alter the HRW destination for stored objects, it can trigger global rebalance.
52
+
> if a topology change can alter the destination for stored objects, it can trigger global rebalance.
52
53
53
54
When a single target is added to or removed from a cluster of `N` targets, the fraction of objects that move is typically on the order of `1/N`, though the exact amount always depends on the topology and the current object distribution.
54
55
@@ -67,12 +68,14 @@ At a high level:
67
68
5. If the local target is no longer the correct owner, the object is sent directly to the proper target.
68
69
6. The process completes when all participating targets finish migrating the objects that no longer belong locally.
69
70
71
+
The steps above describe regular, data-moving rebalance. Cleanup mode, described below, reuses the rebalance lifecycle but does not migrate object payloads.
72
+
70
73
A few points are worth emphasizing:
71
74
72
-
- there is no central data-movement coordinator that decides ownership object by object
73
-
- each target independently evaluates the objects it currently stores
74
-
- object migration is performed via AIS intra-cluster transfers
75
-
- when provisioned, rebalance traffic can use separate intra-cluster networking
75
+
* there is no central data-movement coordinator that decides ownership object by object
76
+
* each target independently evaluates the objects it currently stores
77
+
* object migration is performed via AIS intra-cluster transfers
78
+
* when provisioned, rebalance traffic can use separate intra-cluster networking
76
79
77
80
This design keeps rebalancing scalable and avoids turning the primary into a data path bottleneck.
78
81
@@ -86,24 +89,24 @@ In particular, the target that must own an object according to the **new** clust
86
89
87
90
As a result:
88
91
89
-
- applications do not need to stop I/O while rebalance is running
90
-
- object movement remains transparent to clients
91
-
- the cluster can continue converging toward the new placement while still serving reads
92
+
* applications do not need to stop I/O while rebalance is running
93
+
* object movement remains transparent to clients
94
+
* the cluster can continue converging toward the new placement while still serving reads
92
95
93
96
## Control and monitoring
94
97
95
98
Global rebalance is controlled and monitored via:
96
99
97
-
- native HTTP-based APIs
98
-
- the [Go API](https://github.com/NVIDIA/aistore/tree/main/api)
99
-
- the [Python SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore/sdk)
100
-
- the [Command Line Interface (CLI)](/docs/cli.md)
100
+
* native HTTP-based APIs
101
+
* the [Go API](https://github.com/NVIDIA/aistore/tree/main/api)
102
+
* the [Python SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore/sdk)
103
+
* the [Command Line Interface (CLI)](/docs/cli.md)
101
104
102
105
Operationally, administrators typically care about three things:
103
106
104
-
- whether automated rebalance is enabled
105
-
- whether a rebalance is currently running
106
-
- how much data each target has sent and received for the current rebalance
107
+
* whether automated rebalance is enabled
108
+
* whether a rebalance is currently running
109
+
* how much data each target has sent and received, or removed in cleanup mode
107
110
108
111
Like other long-running AIS activities, rebalance is tracked as a cluster job and can be inspected while in progress.
cleanup Remove local copies of misplaced objects - monolithic and chunked (non-EC);
233
+
fails if rebalance is running; incompatible with '--latest' and '--sync'
234
+
force,f With '--cleanup': also remove local misplaced copies that fail the safe identity check against copies
235
+
at their expected locations; will not run concurrently with active rebalance/resilver
236
+
(caution: advanced usage only)
229
237
latest Check in-cluster metadata and, possibly, GET, download, prefetch, or otherwise copy the latest object version
230
238
from the associated remote bucket;
231
239
the option provides operation-level control over object versioning (and version synchronization)
@@ -253,6 +261,115 @@ OPTIONS:
253
261
help, h Show help
254
262
```
255
263
264
+
For cleanup mode, see the next section.
265
+
266
+
## Cleanup mode
267
+
268
+
Rebalance can also run in **cleanup mode**:
269
+
270
+
```console
271
+
$ ais start rebalance --cleanup
272
+
```
273
+
274
+
Cleanup mode is an administrative maintenance operation. It reuses the rebalance lifecycle and monitoring machinery, but it does **not** migrate object payloads between targets.
275
+
276
+
### Why cleanup mode was introduced
277
+
278
+
Versions 4.4 and earlier tracked every migrated object with per-object acknowledgments from destination to source, and used those acknowledgments to delete the source copy once placement was confirmed at the destination. That implicit reclamation mechanism did not scale to clusters and buckets with billions of objects and was removed.
279
+
280
+
As a result, regular rebalance no longer reclaims source-side copies implicitly. After a topology change converges, the cluster may continue to hold local copies of objects whose proper owner is now a different target. These copies are not lost data and they do not affect correctness, but they do consume local capacity until something reclaims them.
281
+
282
+
Cleanup mode is the explicit, operator-driven replacement: a separate verified pass - rebalance retracing its own steps - that discovers misplaced local copies and removes only those whose proper owner already has the object.
283
+
284
+
> For broader local-storage hygiene, AIS also provides `ais space-cleanup`. That tool can remove several classes of local garbage, including corrupted metadata files, zero-size objects when requested, extra local copies, misplaced EC artifacts, local mountpath orphans, and verified migrated-away leftovers. Rebalance's cleanup mode is narrower: it is the placement-specific, rebalance-lifecycle mode intended for cleaning up source-side copies left after topology changes and regular data-moving rebalance.
285
+
286
+
### How cleanup mode operates
287
+
288
+
Each target walks its local mountpaths and looks for object copies that no longer belong on that target according to the current cluster map. For every local object, AIS recomputes the expected location. If the local target is already the expected owner, the object is skipped.
289
+
290
+
For a misplaced local copy, AIS contacts the expected owner and requests object properties used to establish identity - size, checksum, version, custom metadata,
291
+
and ETag. Different byte content means a different version: two copies with the same name but divergent metadata are not the same object.
292
+
293
+
The local copy is removed only when AIS can verify that the expected owner holds the same version.
294
+
295
+
In other words, regular rebalance converges placement by moving objects to their proper targets. Cleanup mode converges local storage by removing misplaced copies that are already present at their proper targets.
296
+
297
+
Cleanup mode is intentionally out-of-band. Regular data-moving rebalance can temporarily create extra local copies while the cluster converges, but tracking every migrated object at runtime would not scale for large clusters and buckets with millions or billions of objects. Cleanup mode therefore performs a separate verified pass: it discovers misplaced local copies from the current on-disk namespace and safely removes only those that are already present at their expected locations.
298
+
299
+
### Default behavior is conservative
300
+
301
+
By default, cleanup mode:
302
+
303
+
* removes only local misplaced copies
304
+
* skips objects that already belong on the local target
305
+
* skips EC buckets entirely
306
+
* skips objects with local mirror copies (`mirror.enabled=true`)
307
+
* keeps objects that cannot be verified at their expected location
308
+
* keeps objects whose local metadata differs from the expected owner's metadata
309
+
* does not run concurrently with active rebalance or resilver
310
+
311
+
Cleanup mode is useful after operational workflows such as maintenance, rolling upgrades, or recovery procedures where misplaced local copies may remain and an administrator wants to reclaim local capacity without running a full data-moving rebalance.
312
+
313
+
Cleanup mode can be bucket-scoped and prefix-scoped, similarly to administrative rebalance. It is incompatible with `--latest` and `--sync`.
314
+
315
+
### Monitoring
316
+
317
+
Cleanup mode can be monitored with the usual rebalance commands:
318
+
319
+
```console
320
+
$ ais show rebalance
321
+
$ ais show job
322
+
```
323
+
324
+
When cleanup mode is running or has completed, reported counters describe objects removed and bytes reclaimed rather than objects sent and received.
325
+
326
+
### Example: cleanup after returning a target to service
327
+
328
+
The following abbreviated example shows a three-target cluster where `t[VCft8081]` has been returned from maintenance. Regular rebalance `g23` first moves objects according to the updated cluster map. Cleanup rebalance `g24` then removes leftover misplaced local copies from the other targets.
Note that `t[VCft8081]` does not appear in the cleanup output. Having just returned to service under the current cluster map, it holds no misplaced copies - every local object is HRW-correct from its perspective. Only the targets that had to send data during `g23` carry misplaced leftovers.
343
+
344
+
Cleanup-specific rebalance output reports objects removed and bytes reclaimed:
345
+
346
+
```console
347
+
$ ais show rebalance
348
+
REB ID NODE REMOVED OBJECTS REMOVED BYTES START END STATE
349
+
g24 gsCt8083 12678 12.38MiB 17:15:16 - Running
350
+
g24 vTnt8082 12605 12.31MiB 17:15:16 - Running
351
+
g24: 25283 objects removed (total size 24.7MiB)
352
+
```
353
+
354
+
### Forced cleanup
355
+
356
+
The `--force` option is valid only with cleanup mode:
357
+
358
+
```console
359
+
$ ais start rebalance --cleanup --force
360
+
```
361
+
362
+
Forced cleanup is advanced usage. To explain what it does, recall the default identity check: cleanup removes a misplaced local copy only when the expected owner reports identical metadata (size, checksum, version, ETag, custom metadata).
363
+
364
+
When local metadata diverges from the expected owner's metadata, the two copies are not byte-identical - same name, different content. Concretely, this can happen with a raced write, an out-of-band update at the remote backend, or a stale pre-overwrite leftover. By default, cleanup *keeps* such divergent local copies, because removing one of two non-identical copies is data loss for whoever happens to hold the version that gets deleted.
365
+
366
+
`--force` removes them anyway, treating the HRW owner's copy as authoritative. Use forced cleanup only when you have established that the divergent local copy is the one to discard.
367
+
368
+
Two things `--force` does **not** do:
369
+
370
+
* it does not skip the HRW verification step - AIS still confirms the expected owner has *some* copy of the object before removing the local one;
371
+
* it does not override safety windows (such as `dont_cleanup_time`) and does not allow cleanup to run concurrently with active rebalance or resilver.
372
+
256
373
## Rebalance vs. resilver
257
374
258
375
Both rebalance and resilver restore HRW-based placement, but they do so at different scopes.
@@ -287,3 +404,5 @@ The exact overhead depends on multiple factors, including:
287
404
* whether separate intra-cluster networking is provisioned
288
405
289
406
For this reason, administrators often tune rebalance behavior and may temporarily disable automated rebalance while performing planned maintenance or staged upgrades.
407
+
408
+
Cleanup mode has a different resource profile: it does not migrate object payloads, but it still walks local namespace entries, loads object metadata, performs intra-cluster verification, and removes local files. It should therefore still be treated as a background maintenance operation.
@@ -36,7 +37,30 @@ Any file with `mtime + dont_cleanup_time > now` is skipped to avoid racing again
36
37
37
38
Invalid entries (malformed FQNs, bucket mismatches) are logged and removed.
38
39
39
-
## 2. Cleanup Policies
40
+
## 2. Relation to rebalance cleanup mode
41
+
42
+
`ais space-cleanup` is a general local-storage cleanup tool. It walks local
43
+
mountpaths and removes several classes of safely reclaimable files, including
44
+
objects with corrupted or missing local metadata, zero-size objects when
45
+
configured, extra local copies, misplaced EC artifacts, local mountpath orphans,
46
+
and verified migrated-away leftovers.
47
+
48
+
AIStore also provides:
49
+
50
+
```console
51
+
$ ais start rebalance --cleanup
52
+
````
53
+
54
+
Rebalance cleanup is narrower and more explicit. It reuses the global rebalance lifecycle and monitoring machinery, does not migrate object payloads, and is
55
+
intended specifically for reclaiming source-side copies left after topology changes and regular data-moving rebalance.
56
+
57
+
In short:
58
+
59
+
* use `ais start rebalance --cleanup` when the goal is post-rebalance,
60
+
placement-specific cleanup after maintenance, decommission, scale-out, scale-in, or node return (from maintenance);
61
+
* use `ais space-cleanup` for broader local-storage hygiene and capacity reclamation.
62
+
63
+
## 3. Cleanup Policies
40
64
41
65
### Work Files (`fs.WorkCT`)
42
66
@@ -79,7 +103,7 @@ Behavior depends on whether EC is enabled for the bucket:
79
103
- Handled in `visitObj()`
80
104
- For EC-enabled buckets: objects missing corresponding metafiles flagged as *misplaced EC*
81
105
82
-
## 3. Implementation Details
106
+
## 4. Implementation Details
83
107
84
108
### Throttling
85
109
@@ -92,14 +116,14 @@ Space cleanup uses the unified `cmn/load` throttling (`load.Advice`) to avoid I/
92
116
### Time Dependencies
93
117
Relies on filesystem mtimes. Clock changes on the operator may influence cleanup decisions.
94
118
95
-
## 4. Corner Cases & Constraints
119
+
## 5. Corner Cases & Constraints
96
120
97
121
- **Race Protection**: Slice => Meta and Replica => Meta sequences covered by global recency guard
98
122
- **Local Scope**: Does not consult cluster maps; global orphan detection is out of scope
99
123
- **Encoding Requirements**: `fs.WorkCT` tags, chunk uploadIDs, and chunk numbers must never be empty
100
124
- **Legacy State**: Partial manifests treated as invalid and always removed
101
125
102
-
## 5. Future Enhancements
126
+
## 6. Future Enhancements
103
127
104
128
### Generation-Aware EC Cleanup
105
129
Delay removal when conflicting generations exist; prefer newest metadata.
0 commit comments