Skip to content

server : implement prompt processing progress report in stream mode#15827

Merged
ngxson merged 7 commits into
masterfrom
xsn/server_progress_api
Sep 6, 2025
Merged

server : implement prompt processing progress report in stream mode#15827
ngxson merged 7 commits into
masterfrom
xsn/server_progress_api

Conversation

@ngxson

@ngxson ngxson commented Sep 6, 2025

Copy link
Copy Markdown
Collaborator

Supersede #14731

By including "return_progress": true and "stream": true in the request, the server will return prompt progress object in the stream.

The progress object will look like this:

"prompt_progress":{"total":237,"cache":0,"processed":128,"time_ms":181}

If part of the message is cached:

"prompt_progress":{"total":237,"cache":230,"processed":237,"time_ms":30}

For convenient, the number of cached tokens is also added to the timings object. This is useful for calculating the context usage after a message is generated. The number of used context tokens is equal to prompt_n + cache_n + predicted_n

{
  "choices": [],
  "created": 1757141666,
  "id": "chatcmpl-ecQULm0WqPrftUqjPZO1CFYeDjGZNbDu",
  ...
  "timings": {
    "cache_n": 236,
    "prompt_n": 1,
    "prompt_ms": 30.958,
    "prompt_per_token_ms": 30.958,
    "prompt_per_second": 32.301828283480845,
    "predicted_n": 35,
    "predicted_ms": 661.064,
    "predicted_per_token_ms": 18.887542857142858,
    "predicted_per_second": 52.94494935437416
  }
}

@github-actions github-actions Bot added the python python script changes label Sep 6, 2025
@ngxson ngxson marked this pull request as ready for review September 6, 2025 10:28
@ngxson ngxson requested review from allozaur and ggerganov September 6, 2025 10:36
Comment thread tools/server/server.cpp Outdated
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@ngxson ngxson merged commit 61bdfd5 into master Sep 6, 2025
49 of 50 checks passed
@ExtReMLapin

Copy link
Copy Markdown
Contributor

Well that was fast

@BradHutchings

Copy link
Copy Markdown

Wow, thank you! I look forward to trying this out!

@BradHutchings

Copy link
Copy Markdown

This works great. It was easy enough to update my client code from how the previous PR worked. Thank you again @ngxson!

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
…gml-org#15827)

* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@narendrachaudhary51

Copy link
Copy Markdown

One issue is that I cannot see real time tokens/s for a user on webui of server. After this change, I only see token/s at the end of generation. Earlier feature was particularly useful.

@ngxson

ngxson commented Sep 8, 2025

Copy link
Copy Markdown
Collaborator Author

One issue is that I cannot see real time tokens/s for a user on webui of server. After this change, I only see token/s at the end of generation. Earlier feature was particularly useful.

Hmm ok I accidentally remove one line of code, fixing it now

@ngxson ngxson deleted the xsn/server_progress_api branch October 5, 2025 11:28
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
…#15827)

* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…gml-org#15827)

* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
…gml-org#15827)

* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
…gml-org#15827)

* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
…gml-org#15827)

* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
phibya pushed a commit to ziee-ai/llama.cpp that referenced this pull request May 29, 2026
…gml-org#15827)

* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
fewtarius pushed a commit to fewtarius/CachyLLama that referenced this pull request May 30, 2026
…gml-org#15827)

* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants