Skip to content

[lake/tiering] Add per-table monitoring metrics for Lake Tiering#2454

Merged
luoyuxia merged 5 commits into
apache:mainfrom
beryllw:lake-metrics-poc
Mar 13, 2026
Merged

[lake/tiering] Add per-table monitoring metrics for Lake Tiering#2454
luoyuxia merged 5 commits into
apache:mainfrom
beryllw:lake-metrics-poc

Conversation

@beryllw

@beryllw beryllw commented Jan 23, 2026

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #2440
This PR adds table-level and global monitoring metrics for the lake tiering service managed by CoordinatorServer.

Brief change log

Scope: coordinator > lakeTiering_table (per-table, tagged with database, table, tableId)

Metric Type Description
tierLag Gauge Milliseconds since the last successful tiering. For newly registered tables, measured from registration time
tierDuration Gauge Wall-clock duration (ms) of the last completed tiering round. Returns -1 until the first round completes
failuresTotal Gauge Cumulative tiering failure count for this table
fileSize Gauge Cumulative total file size (bytes) of the lake table after the last tiering. Returns -1 until the first round completes
recordCount Gauge Cumulative total record count after the last tiering. Returns -1 until the first round completes

Tests

API and Format

Documentation

@beryllw beryllw marked this pull request as draft January 23, 2026 07:50

@zuston zuston left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this forward! @beryllw .

From my perspective, the key observation for lake tiering is to expose the tiering progress—whether it is catching up with the latest data written to lake storage. This would help us decide whether to scale the tiering service Flink job up or down.
cc @luoyuxia @wuchong Please let me know if I’m missing anything.

Maybe I can build the above metric on top of your current framework.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds coordinator-side monitoring for the Lake Tiering subsystem by introducing new metric groups/metrics and wiring end-to-end propagation of per-table lake stats (file size / record count) from lake committers → Flink tiering job → coordinator heartbeat handling.

Changes:

  • Introduce LakeTieringMetricGroup and register coordinator-level + table-level lake tiering gauges in LakeTableTieringManager.
  • Extend tiering heartbeat payloads to include PbLakeTieringStats and propagate stats from lake committers through the Flink tiering pipeline.
  • Update documentation and unit tests to reflect the new metrics and the updated tiering manager APIs.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
website/docs/maintenance/observability/monitor-metrics.md Documents new coordinator lake tiering metrics (global + per-table).
fluss-server/src/test/java/org/apache/fluss/server/metrics/group/TestingMetricGroups.java Adds a test LakeTieringMetricGroup for coordinator tests.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/statemachine/TableBucketStateMachineTest.java Updates tiering manager construction to pass metric group.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/rebalance/RebalanceManagerTest.java Updates tiering manager construction to pass metric group.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/LakeTableTieringManagerTest.java Adapts to new APIs; adds assertions for new metrics behavior.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessorTest.java Updates tiering manager construction to pass metric group.
fluss-server/src/main/java/org/apache/fluss/server/metrics/group/LakeTieringMetricGroup.java New metric group for lake tiering + per-table subgroups.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/LakeTableTieringManager.java Registers/updates lake tiering metrics; extends finish API to accept lake stats.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorService.java Reads optional tiering stats from finished-table heartbeats and forwards to manager.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorServer.java Constructs tiering manager with a real LakeTieringMetricGroup.
fluss-rpc/src/main/proto/FlussApi.proto Adds PbLakeTieringStats and optional inclusion in heartbeat table requests.
fluss-lake/fluss-lake-paimon/src/main/java/org/apache/fluss/lake/paimon/tiering/PaimonLakeCommitter.java Computes and returns cumulative table stats for a committed snapshot (best-effort).
fluss-lake/fluss-lake-lance/src/main/java/org/apache/fluss/lake/lance/tiering/LanceLakeCommitter.java Notes stats aren’t available yet; leaves values unknown.
fluss-lake/fluss-lake-iceberg/src/main/java/org/apache/fluss/lake/iceberg/tiering/IcebergLakeCommitter.java Extracts cumulative table stats from snapshot summary and returns them.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/enumerator/TieringSourceEnumerator.java Sends per-finished-table stats in heartbeat finished table entries.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/event/TieringStats.java New immutable stats container with -1 unknown sentinels.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/event/FinishedTieringEvent.java Extends finished event to include tiering stats.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/committer/TieringCommitOperator.java Collects LakeCommitResult stats and emits them via FinishedTieringEvent.
fluss-common/src/main/java/org/apache/fluss/metrics/MetricNames.java Adds metric name constants for lake tiering gauges.
fluss-common/src/main/java/org/apache/fluss/lake/committer/LakeCommitResult.java Extends commit result to carry cumulative lake file size / record count.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread website/docs/maintenance/observability/monitor-metrics.md Outdated
@beryllw beryllw force-pushed the lake-metrics-poc branch from d15732e to 35c6003 Compare March 9, 2026 02:01
@beryllw beryllw marked this pull request as ready for review March 9, 2026 02:03
@beryllw beryllw force-pushed the lake-metrics-poc branch 4 times, most recently from 7f66f0b to 56814b5 Compare March 10, 2026 07:26
@beryllw beryllw changed the title [lake/tiering] Add CoordinatorServer Monitoring Metrics for Lake Tiering [lake/tiering] Add per-table monitoring metrics for Lake Tiering Mar 10, 2026

@luoyuxia luoyuxia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beryllw Thanks for the pr. Left some comments. PTAL

@beryllw beryllw requested a review from luoyuxia March 12, 2026 09:13
@beryllw

beryllw commented Mar 12, 2026

Copy link
Copy Markdown
Contributor Author
image

Tested the metric names and label values in Docker as shown above.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@luoyuxia luoyuxia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beryllw Thanks for the pr. Left minor comments. PTAL

Comment thread website/docs/maintenance/observability/monitor-metrics.md Outdated
Comment thread fluss-common/src/main/java/org/apache/fluss/lake/committer/LakeCommitResult.java Outdated
@beryllw beryllw requested a review from luoyuxia March 13, 2026 06:35

@luoyuxia luoyuxia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@luoyuxia luoyuxia merged commit f7fa971 into apache:main Mar 13, 2026
7 checks passed
hemanthsavasere pushed a commit to hemanthsavasere/fluss that referenced this pull request Mar 14, 2026
wxplovecc pushed a commit to tongcheng-elong/fluss that referenced this pull request Apr 17, 2026
wxplovecc pushed a commit to tongcheng-elong/fluss that referenced this pull request Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[lake/tiering] Add Server-Level Monitoring Metrics for Lake Tiering

4 participants