[lake/tiering] Add per-table monitoring metrics for Lake Tiering#2454
Conversation
621573c to
85b4cca
Compare
There was a problem hiding this comment.
Thanks for pushing this forward! @beryllw .
From my perspective, the key observation for lake tiering is to expose the tiering progress—whether it is catching up with the latest data written to lake storage. This would help us decide whether to scale the tiering service Flink job up or down.
cc @luoyuxia @wuchong Please let me know if I’m missing anything.
Maybe I can build the above metric on top of your current framework.
510e1e5 to
e4eef43
Compare
There was a problem hiding this comment.
Pull request overview
Adds coordinator-side monitoring for the Lake Tiering subsystem by introducing new metric groups/metrics and wiring end-to-end propagation of per-table lake stats (file size / record count) from lake committers → Flink tiering job → coordinator heartbeat handling.
Changes:
- Introduce
LakeTieringMetricGroupand register coordinator-level + table-level lake tiering gauges inLakeTableTieringManager. - Extend tiering heartbeat payloads to include
PbLakeTieringStatsand propagate stats from lake committers through the Flink tiering pipeline. - Update documentation and unit tests to reflect the new metrics and the updated tiering manager APIs.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| website/docs/maintenance/observability/monitor-metrics.md | Documents new coordinator lake tiering metrics (global + per-table). |
| fluss-server/src/test/java/org/apache/fluss/server/metrics/group/TestingMetricGroups.java | Adds a test LakeTieringMetricGroup for coordinator tests. |
| fluss-server/src/test/java/org/apache/fluss/server/coordinator/statemachine/TableBucketStateMachineTest.java | Updates tiering manager construction to pass metric group. |
| fluss-server/src/test/java/org/apache/fluss/server/coordinator/rebalance/RebalanceManagerTest.java | Updates tiering manager construction to pass metric group. |
| fluss-server/src/test/java/org/apache/fluss/server/coordinator/LakeTableTieringManagerTest.java | Adapts to new APIs; adds assertions for new metrics behavior. |
| fluss-server/src/test/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessorTest.java | Updates tiering manager construction to pass metric group. |
| fluss-server/src/main/java/org/apache/fluss/server/metrics/group/LakeTieringMetricGroup.java | New metric group for lake tiering + per-table subgroups. |
| fluss-server/src/main/java/org/apache/fluss/server/coordinator/LakeTableTieringManager.java | Registers/updates lake tiering metrics; extends finish API to accept lake stats. |
| fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorService.java | Reads optional tiering stats from finished-table heartbeats and forwards to manager. |
| fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorServer.java | Constructs tiering manager with a real LakeTieringMetricGroup. |
| fluss-rpc/src/main/proto/FlussApi.proto | Adds PbLakeTieringStats and optional inclusion in heartbeat table requests. |
| fluss-lake/fluss-lake-paimon/src/main/java/org/apache/fluss/lake/paimon/tiering/PaimonLakeCommitter.java | Computes and returns cumulative table stats for a committed snapshot (best-effort). |
| fluss-lake/fluss-lake-lance/src/main/java/org/apache/fluss/lake/lance/tiering/LanceLakeCommitter.java | Notes stats aren’t available yet; leaves values unknown. |
| fluss-lake/fluss-lake-iceberg/src/main/java/org/apache/fluss/lake/iceberg/tiering/IcebergLakeCommitter.java | Extracts cumulative table stats from snapshot summary and returns them. |
| fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/enumerator/TieringSourceEnumerator.java | Sends per-finished-table stats in heartbeat finished table entries. |
| fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/event/TieringStats.java | New immutable stats container with -1 unknown sentinels. |
| fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/event/FinishedTieringEvent.java | Extends finished event to include tiering stats. |
| fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/committer/TieringCommitOperator.java | Collects LakeCommitResult stats and emits them via FinishedTieringEvent. |
| fluss-common/src/main/java/org/apache/fluss/metrics/MetricNames.java | Adds metric name constants for lake tiering gauges. |
| fluss-common/src/main/java/org/apache/fluss/lake/committer/LakeCommitResult.java | Extends commit result to carry cumulative lake file size / record count. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d15732e to
35c6003
Compare
7f66f0b to
56814b5
Compare
56814b5 to
e34e03e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Purpose
Linked issue: close #2440
This PR adds table-level and global monitoring metrics for the lake tiering service managed by CoordinatorServer.
Brief change log
Scope: coordinator > lakeTiering_table (per-table, tagged with database, table, tableId)
tierLagtierDuration-1until the first round completesfailuresTotalfileSize-1until the first round completesrecordCount-1until the first round completesTests
API and Format
Documentation