Releases · ggml-org/llama.cpp

19 Jun 09:26

github-actions

b9722

159d093

b9722 Latest

Latest

server: fix non-bound n_discard value (ctx shifting) (#24786)

server: fix non-bound n_discard value
Update tools/server/server-context.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2026-06-19T09:26:02Z
cudart-llama-bin-win-cuda-13.3-x64.zip

sha256:1462a050eb4c684921ba51dcc4cc488a036674c3e73e9945ee705b854808d03e

373 MB 2026-06-19T09:26:13Z
llama-b9722-bin-android-arm64.tar.gz

sha256:5a64c69346c9ef0058ecfd8ba4c0a32a35df2a0a19dae0a531c51ffe020a545e

73.2 MB 2026-06-19T09:26:24Z
llama-b9722-bin-macos-arm64.tar.gz

sha256:c81a8f2dbe45947c3511b4fe35802bd56b56592abfe1d1a3aa76ec7328efbbc3

10.4 MB 2026-06-19T09:26:27Z
llama-b9722-bin-macos-x64.tar.gz

sha256:2528df3eb60ceeaadcbf62175590d9e6c83713824dd347a820c18c05f2ff3f11

10.7 MB 2026-06-19T09:26:28Z
llama-b9722-bin-ubuntu-arm64.tar.gz

sha256:f0da6842d501166896cf98ba6b73e38e53195b50a98e951c313c43409b1d5462

12 MB 2026-06-19T09:26:29Z
llama-b9722-bin-ubuntu-openvino-2026.2-x64.tar.gz

sha256:e0daa8453252c5e40b120dda1bf4d4bd8bb4c99ae07bc4dc2b3b277e9370de97

13.5 MB 2026-06-19T09:26:30Z
llama-b9722-bin-ubuntu-rocm-7.2-x64.tar.gz

sha256:4744192eba055640722ee97f4b34dee23948c475c2d61253c682c731adf6059a

125 MB 2026-06-19T09:26:31Z
llama-b9722-bin-ubuntu-s390x.tar.gz

sha256:4233eeab199202bf0f7dd9bec6a2f14f237fcc2f9ef0923c46b23307cb4c8b3c

14 MB 2026-06-19T09:26:35Z
llama-b9722-bin-ubuntu-sycl-fp16-x64.tar.gz

sha256:ded88a14391b0e57c3471b4767e56af1a3cf10741f02850b77b9aff1bbf90d9f

45.5 MB 2026-06-19T09:26:36Z
Source code (zip)

2026-06-19T08:53:44Z
Source code (tar.gz)

2026-06-19T08:53:44Z

19 Jun 08:09

github-actions

b9721

5fd2dc2

b9721

sync : ggml

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

19 Jun 07:33

github-actions

b9718

80452d6

b9718

server : consolidate slot selection into get_available_slot (#24755)

Absorb get_slot_by_id logic into get_available_slot so slot selection
is handled by a single function call. When a specific slot id is
requested, the LCP similarity check still runs to enable proper
prompt cache updates.

Assisted-by: pi:llama.cpp/Qwen3.6-27B

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

19 Jun 06:32

github-actions

b9717

8141e73

b9717

ggml-cpu: support K tails in power10 Q8/Q4 MMA matmul (#24753)

ggml-cpu: support K tails in Power10 MMA Q8/Q4 matmul

This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path. Process the final K panel using its actual depth and pass the reduced panel size through packing and kernel execution. This allows more workloads to use the MMA kernel and reduces fallback to mnpack.

Apply suggestion from @taronaeo

Co-authored-by: Aaron Teo taronaeo@gmail.com

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Contributors

taronaeo

Assets 27

19 Jun 03:55

github-actions

b9716

db52540

b9716

mtmd: add batching support for internvl (#24775)

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

19 Jun 03:25

github-actions

b9715

3a3edc9

b9715

Ggml/cuda col2im 1d (#24417)

cuda: add GGML_OP_COL2IM_1D, follow-up to the CPU op
cuda: col2im_1d use fast_div_modulo for the index decomposition
cuda: col2im_1d tighten supports_op, type match and contiguous dst

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

19 Jun 02:57

github-actions

b9714

40f3aaf

b9714

server: add "X-Accel-Buffering": "no" header to streaming endpoints (#24774)

server: add "X-Accel-Buffering": "no" header to streaming endpoints

This header tells Nginx (as a reverse proxy) to NOT buffer responses. (only affects streaming endpoints)
Without it, Nginx will break streaming with certain applications (notably the Pi coding harness).

macOS/iOS: