ggml-cpu: support K tails in power10 Q8/Q4 MMA matmul#24753
Merged
Conversation
This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path. Process the final K panel using its actual depth and pass the reduced panel size through packing and kernel execution. This allows more workloads to use the MMA kernel and reduces fallback to mnpack.
ggml-cpu: support K tails in Power10 MMA Q8/Q4 matmul
Contributor
Author
|
@taronaeo @ggerganov can you please help review this PR ? |
ggerganov
approved these changes
Jun 18, 2026
taronaeo
reviewed
Jun 18, 2026
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
taronaeo
approved these changes
Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path. Process the final K panel using its actual depth and pass the reduced panel size through packing and kernel execution. This allows more workloads to use the MMA kernel and reduces fallback to mnpack.
Performance Impact:
~ 60% gain in PP speed with granite-3.38b-instruct Q8_0 and Q4_0 models tested with llama-bench -p 512 -n 1 on power10 ppc64le box.
Additional information
Requirements