Skip to content

server : (refactor) no more json in server_task input#10691

Merged
ngxson merged 8 commits into
ggml-org:masterfrom
ngxson:xsn/refactor_server_struct_input
Dec 7, 2024
Merged

server : (refactor) no more json in server_task input#10691
ngxson merged 8 commits into
ggml-org:masterfrom
ngxson:xsn/refactor_server_struct_input

Conversation

@ngxson

@ngxson ngxson commented Dec 6, 2024

Copy link
Copy Markdown
Collaborator

Continue #10643

server_task_result is already broken into multiple derived classes (polymorphism). This helps reduce code complexity because each of the result type is different from another.

However, the server_task can't be benefit from the same approach, because most requests share the same parameters with other.

The solution introduced by this PR is to just put everything into server_task. Also the JSON parsing is now done at HTTP thread. Up on receiving a request, HTTP thread parse JSON into one or more server_task and push them to server_queue


Example of /slots response:

[
  {
    "id": 0,
    "id_task": -1,
    "n_ctx": 1024,
    "speculative": false,
    "is_processing": false,
    "params": {
      "n_predict": -1,
      "seed": 4294967295,
      "temperature": 0.800000011920929,
      "dynatemp_range": 0.0,
      "dynatemp_exponent": 1.0,
      "top_k": 40,
      "top_p": 0.949999988079071,
      "min_p": 0.05000000074505806,
      "xtc_probability": 0.0,
      "xtc_threshold": 0.10000000149011612,
      "typical_p": 1.0,
      "repeat_last_n": 64,
      "repeat_penalty": 1.0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0,
      "dry_multiplier": 0.0,
      "dry_base": 1.75,
      "dry_allowed_length": 2,
      "dry_penalty_last_n": -1,
      "dry_sequence_breakers": [
        "\n",
        ":",
        "\"",
        "*"
      ],
      "mirostat": 0,
      "mirostat_tau": 5.0,
      "mirostat_eta": 0.10000000149011612,
      "penalize_nl": false,
      "stop": [],
      "max_tokens": -1,
      "n_keep": 0,
      "n_discard": 0,
      "ignore_eos": false,
      "stream": true,
      "n_probs": 0,
      "min_keep": 0,
      "grammar": "",
      "samplers": [
        "dry",
        "top_k",
        "typ_p",
        "top_p",
        "min_p",
        "xtc",
        "temperature"
      ],
      "speculative.n_max": 16,
      "speculative.n_min": 5,
      "speculative.p_min": 0.8999999761581421,
      "timings_per_token": false
    },
    "prompt": "",
    "next_token": {
      "has_next_token": true,
      "has_new_line": false,
      "n_remain": -1,
      "n_decoded": 0,
      "stopping_word": ""
    }
  }
]

Example of /props response:

{
  "default_generation_settings": {
    "id": 0,
    "id_task": -1,
    "n_ctx": 1024,
    "speculative": false,
    "is_processing": false,
    "params": {
      "n_predict": -1,
      "seed": 4294967295,
      "temperature": 0.800000011920929,
      "dynatemp_range": 0.0,
      "dynatemp_exponent": 1.0,
      "top_k": 40,
      "top_p": 0.949999988079071,
      "min_p": 0.05000000074505806,
      "xtc_probability": 0.0,
      "xtc_threshold": 0.10000000149011612,
      "typical_p": 1.0,
      "repeat_last_n": 64,
      "repeat_penalty": 1.0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0,
      "dry_multiplier": 0.0,
      "dry_base": 1.75,
      "dry_allowed_length": 2,
      "dry_penalty_last_n": -1,
      "dry_sequence_breakers": [
        "\n",
        ":",
        "\"",
        "*"
      ],
      "mirostat": 0,
      "mirostat_tau": 5.0,
      "mirostat_eta": 0.10000000149011612,
      "penalize_nl": false,
      "stop": [],
      "max_tokens": -1,
      "n_keep": 0,
      "n_discard": 0,
      "ignore_eos": false,
      "stream": true,
      "n_probs": 0,
      "min_keep": 0,
      "grammar": "",
      "samplers": [
        "dry",
        "top_k",
        "typ_p",
        "top_p",
        "min_p",
        "xtc",
        "temperature"
      ],
      "speculative.n_max": 16,
      "speculative.n_min": 5,
      "speculative.p_min": 0.8999999761581421,
      "timings_per_token": false
    },
    "prompt": "",
    "next_token": {
      "has_next_token": true,
      "has_new_line": false,
      "n_remain": -1,
      "n_decoded": 0,
      "stopping_word": ""
    }
  },
  "total_slots": 1,
  "model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
  "chat_template": "..."
}

@ngxson ngxson requested a review from ggerganov December 6, 2024 14:12
@github-actions github-actions Bot added examples python python script changes server labels Dec 6, 2024
Comment thread examples/server/server.cpp Outdated

std::vector<std::string> antiprompt;
bool timings_per_token = false;
bool ignore_eos = false;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With new models the ignore_eos functionality is losing relevance. There are now many different "end-of-generation" tokens and it's not just a single EOS token anymore. We should remove this logic and only support logit biases, which is more general. Just a note, no need to do it in this PR.

Comment thread examples/server/server.cpp Outdated
Comment thread examples/server/server.cpp Outdated
Comment thread examples/server/server.cpp Outdated
@ngxson ngxson merged commit 3573fa8 into ggml-org:master Dec 7, 2024
@ggerganov

Copy link
Copy Markdown
Member

This change breaks the infill endpoint - it produces mostly garbage.

@ngxson

ngxson commented Dec 8, 2024

Copy link
Copy Markdown
Collaborator Author

Hmm ok could be due to the infill "template" is not being applied correctly. I'll add a test with qwen model (run locally, not on CI)

@ngxson

ngxson commented Dec 8, 2024

Copy link
Copy Markdown
Collaborator Author

I'm on it, will make a PR

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* server : (refactor) no more json in server_task input

* add test for slots endpoint

* add tests for /props and /slots

* remove task inf_type

* fix CI by adding safe_json_to_str

* add "model_path" to /props

* update readme
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* server : (refactor) no more json in server_task input

* add test for slots endpoint

* add tests for /props and /slots

* remove task inf_type

* fix CI by adding safe_json_to_str

* add "model_path" to /props

* update readme
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* server : (refactor) no more json in server_task input

* add test for slots endpoint

* add tests for /props and /slots

* remove task inf_type

* fix CI by adding safe_json_to_str

* add "model_path" to /props

* update readme
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* server : (refactor) no more json in server_task input

* add test for slots endpoint

* add tests for /props and /slots

* remove task inf_type

* fix CI by adding safe_json_to_str

* add "model_path" to /props

* update readme
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* server : (refactor) no more json in server_task input

* add test for slots endpoint

* add tests for /props and /slots

* remove task inf_type

* fix CI by adding safe_json_to_str

* add "model_path" to /props

* update readme
phibya pushed a commit to ziee-ai/llama.cpp that referenced this pull request May 29, 2026
* server : (refactor) no more json in server_task input

* add test for slots endpoint

* add tests for /props and /slots

* remove task inf_type

* fix CI by adding safe_json_to_str

* add "model_path" to /props

* update readme
AlexiAlp pushed a commit to minghaop/llama.cpp that referenced this pull request Jun 2, 2026
* server : (refactor) no more json in server_task input

* add test for slots endpoint

* add tests for /props and /slots

* remove task inf_type

* fix CI by adding safe_json_to_str

* add "model_path" to /props

* update readme
AlexiAlp pushed a commit to minghaop/llama.cpp that referenced this pull request Jun 2, 2026
* server : (refactor) no more json in server_task input

* add test for slots endpoint

* add tests for /props and /slots

* remove task inf_type

* fix CI by adding safe_json_to_str

* add "model_path" to /props

* update readme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants