server : implement prompt processing progress report in stream mode by ngxson · Pull Request #15827 · ggml-org/llama.cpp

ngxson · 2025-09-06T04:27:06Z

Supersede #14731

By including "return_progress": true and "stream": true in the request, the server will return prompt progress object in the stream.

The progress object will look like this:

"prompt_progress":{"total":237,"cache":0,"processed":128,"time_ms":181}

If part of the message is cached:

"prompt_progress":{"total":237,"cache":230,"processed":237,"time_ms":30}

For convenient, the number of cached tokens is also added to the timings object. This is useful for calculating the context usage after a message is generated. The number of used context tokens is equal to prompt_n + cache_n + predicted_n

{
  "choices": [],
  "created": 1757141666,
  "id": "chatcmpl-ecQULm0WqPrftUqjPZO1CFYeDjGZNbDu",
  ...
  "timings": {
    "cache_n": 236,
    "prompt_n": 1,
    "prompt_ms": 30.958,
    "prompt_per_token_ms": 30.958,
    "prompt_per_second": 32.301828283480845,
    "predicted_n": 35,
    "predicted_ms": 661.064,
    "predicted_per_token_ms": 18.887542857142858,
    "predicted_per_second": 52.94494935437416
  }
}

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ExtReMLapin · 2025-09-06T11:36:09Z

Well that was fast

BradHutchings · 2025-09-06T17:59:28Z

Wow, thank you! I look forward to trying this out!

BradHutchings · 2025-09-07T01:33:08Z

This works great. It was easy enough to update my client code from how the previous PR worked. Thank you again @ngxson!

…gml-org#15827) * server : implement `return_progress` * add timings.cache_n * add progress.time_ms * add test * fix test for chat/completions * readme: add docs on timings * use ggml_time_us Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

narendrachaudhary51 · 2025-09-08T04:37:24Z

One issue is that I cannot see real time tokens/s for a user on webui of server. After this change, I only see token/s at the end of generation. Earlier feature was particularly useful.

ngxson · 2025-09-08T14:14:52Z

One issue is that I cannot see real time tokens/s for a user on webui of server. After this change, I only see token/s at the end of generation. Earlier feature was particularly useful.

Hmm ok I accidentally remove one line of code, fixing it now

…#15827) * server : implement `return_progress` * add timings.cache_n * add progress.time_ms * add test * fix test for chat/completions * readme: add docs on timings * use ggml_time_us Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…gml-org#15827) * server : implement `return_progress` * add timings.cache_n * add progress.time_ms * add test * fix test for chat/completions * readme: add docs on timings * use ggml_time_us Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

server : implement return_progress

29f1d50

ngxson mentioned this pull request Sep 6, 2025

feat: Add optional prompt processing progress streaming #14731

Closed

github-actions Bot added examples server labels Sep 6, 2025

ngxson added 3 commits September 6, 2025 13:49

add timings.cache_n

4404ad8

add progress.time_ms

e166a55

add test

f4213cc

github-actions Bot added the python python script changes label Sep 6, 2025

fix test for chat/completions

ebcef91

ngxson marked this pull request as ready for review September 6, 2025 10:28

readme: add docs on timings

b6ac24c

ngxson requested review from allozaur and ggerganov September 6, 2025 10:36

ggerganov approved these changes Sep 6, 2025

View reviewed changes

ggerganov reviewed Sep 6, 2025

View reviewed changes

Comment thread tools/server/server.cpp Outdated

use ggml_time_us

053dc6b

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngxson merged commit 61bdfd5 into master Sep 6, 2025
49 of 50 checks passed

ngxson mentioned this pull request Aug 28, 2025

changelog : llama-server REST API #9291

Open

mostlygeek mentioned this pull request Sep 6, 2025

Support llama.cpp's cache_n in timings info mostlygeek/llama-swap#287

Merged

ngxson mentioned this pull request Sep 8, 2025

server : bring back timings_per_token #15879

Merged

ngxson deleted the xsn/server_progress_api branch October 5, 2025 11:28

jakexcosme mentioned this pull request Oct 22, 2025

changelog : llama-server REST API COG-GTM/llama.cpp#245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : implement prompt processing progress report in stream mode#15827

server : implement prompt processing progress report in stream mode#15827
ngxson merged 7 commits into
masterfrom
xsn/server_progress_api

ngxson commented Sep 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ExtReMLapin commented Sep 6, 2025

Uh oh!

BradHutchings commented Sep 6, 2025

Uh oh!

BradHutchings commented Sep 7, 2025

Uh oh!

narendrachaudhary51 commented Sep 8, 2025

Uh oh!

ngxson commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ngxson commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ExtReMLapin commented Sep 6, 2025

Uh oh!

BradHutchings commented Sep 6, 2025

Uh oh!

BradHutchings commented Sep 7, 2025

Uh oh!

narendrachaudhary51 commented Sep 8, 2025

Uh oh!

ngxson commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ngxson commented Sep 6, 2025 •

edited

Loading