Skip to content

Releases: ggml-org/llama.cpp

b9722

19 Jun 09:26
159d093

Choose a tag to compare

server: fix non-bound n_discard value (ctx shifting) (#24786)

  • server: fix non-bound n_discard value

  • Update tools/server/server-context.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9721

19 Jun 08:09

Choose a tag to compare

b9718

19 Jun 07:33
80452d6

Choose a tag to compare

server : consolidate slot selection into get_available_slot (#24755)

Absorb get_slot_by_id logic into get_available_slot so slot selection
is handled by a single function call. When a specific slot id is
requested, the LCP similarity check still runs to enable proper
prompt cache updates.

Assisted-by: pi:llama.cpp/Qwen3.6-27B

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9717

19 Jun 06:32
8141e73

Choose a tag to compare

ggml-cpu: support K tails in power10 Q8/Q4 MMA matmul (#24753)

  • ggml-cpu: support K tails in Power10 MMA Q8/Q4 matmul

This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path. Process the final K panel using its actual depth and pass the reduced panel size through packing and kernel execution. This allows more workloads to use the MMA kernel and reduces fallback to mnpack.

Co-authored-by: Aaron Teo taronaeo@gmail.com


Co-authored-by: Aaron Teo taronaeo@gmail.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9716

19 Jun 03:55
db52540

Choose a tag to compare

b9715

19 Jun 03:25
3a3edc9

Choose a tag to compare

Ggml/cuda col2im 1d (#24417)

  • cuda: add GGML_OP_COL2IM_1D, follow-up to the CPU op

  • cuda: col2im_1d use fast_div_modulo for the index decomposition

  • cuda: col2im_1d tighten supports_op, type match and contiguous dst

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9714

19 Jun 02:57
40f3aaf

Choose a tag to compare

server: add "X-Accel-Buffering": "no" header to streaming endpoints (#24774)

  • server: add "X-Accel-Buffering": "no" header to streaming endpoints

This header tells Nginx (as a reverse proxy) to NOT buffer responses. (only affects streaming endpoints)
Without it, Nginx will break streaming with certain applications (notably the Pi coding harness).

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9713

19 Jun 02:26
a6b3260

Choose a tag to compare

b9712

19 Jun 01:59
32eddaf

Choose a tag to compare

b9711

19 Jun 01:33
060ce1b

Choose a tag to compare

mtmd: refactor llava-uhd overview image handling (always use ov_img_first) (#24769)

  • add dedicated "overview" for mtmd_image_preproc_out

  • corrections

  • correct (again)

  • nits

  • nits (2)

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI: