server : (refactor) no more json in server_task input by ngxson · Pull Request #10691 · ggml-org/llama.cpp

ngxson · 2024-12-06T14:12:59Z

Continue #10643

server_task_result is already broken into multiple derived classes (polymorphism). This helps reduce code complexity because each of the result type is different from another.

However, the server_task can't be benefit from the same approach, because most requests share the same parameters with other.

The solution introduced by this PR is to just put everything into server_task. Also the JSON parsing is now done at HTTP thread. Up on receiving a request, HTTP thread parse JSON into one or more server_task and push them to server_queue

Example of /slots response:

[
  {
    "id": 0,
    "id_task": -1,
    "n_ctx": 1024,
    "speculative": false,
    "is_processing": false,
    "params": {
      "n_predict": -1,
      "seed": 4294967295,
      "temperature": 0.800000011920929,
      "dynatemp_range": 0.0,
      "dynatemp_exponent": 1.0,
      "top_k": 40,
      "top_p": 0.949999988079071,
      "min_p": 0.05000000074505806,
      "xtc_probability": 0.0,
      "xtc_threshold": 0.10000000149011612,
      "typical_p": 1.0,
      "repeat_last_n": 64,
      "repeat_penalty": 1.0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0,
      "dry_multiplier": 0.0,
      "dry_base": 1.75,
      "dry_allowed_length": 2,
      "dry_penalty_last_n": -1,
      "dry_sequence_breakers": [
        "\n",
        ":",
        "\"",
        "*"
      ],
      "mirostat": 0,
      "mirostat_tau": 5.0,
      "mirostat_eta": 0.10000000149011612,
      "penalize_nl": false,
      "stop": [],
      "max_tokens": -1,
      "n_keep": 0,
      "n_discard": 0,
      "ignore_eos": false,
      "stream": true,
      "n_probs": 0,
      "min_keep": 0,
      "grammar": "",
      "samplers": [
        "dry",
        "top_k",
        "typ_p",
        "top_p",
        "min_p",
        "xtc",
        "temperature"
      ],
      "speculative.n_max": 16,
      "speculative.n_min": 5,
      "speculative.p_min": 0.8999999761581421,
      "timings_per_token": false
    },
    "prompt": "",
    "next_token": {
      "has_next_token": true,
      "has_new_line": false,
      "n_remain": -1,
      "n_decoded": 0,
      "stopping_word": ""
    }
  }
]

Example of /props response:

{
  "default_generation_settings": {
    "id": 0,
    "id_task": -1,
    "n_ctx": 1024,
    "speculative": false,
    "is_processing": false,
    "params": {
      "n_predict": -1,
      "seed": 4294967295,
      "temperature": 0.800000011920929,
      "dynatemp_range": 0.0,
      "dynatemp_exponent": 1.0,
      "top_k": 40,
      "top_p": 0.949999988079071,
      "min_p": 0.05000000074505806,
      "xtc_probability": 0.0,
      "xtc_threshold": 0.10000000149011612,
      "typical_p": 1.0,
      "repeat_last_n": 64,
      "repeat_penalty": 1.0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0,
      "dry_multiplier": 0.0,
      "dry_base": 1.75,
      "dry_allowed_length": 2,
      "dry_penalty_last_n": -1,
      "dry_sequence_breakers": [
        "\n",
        ":",
        "\"",
        "*"
      ],
      "mirostat": 0,
      "mirostat_tau": 5.0,
      "mirostat_eta": 0.10000000149011612,
      "penalize_nl": false,
      "stop": [],
      "max_tokens": -1,
      "n_keep": 0,
      "n_discard": 0,
      "ignore_eos": false,
      "stream": true,
      "n_probs": 0,
      "min_keep": 0,
      "grammar": "",
      "samplers": [
        "dry",
        "top_k",
        "typ_p",
        "top_p",
        "min_p",
        "xtc",
        "temperature"
      ],
      "speculative.n_max": 16,
      "speculative.n_min": 5,
      "speculative.p_min": 0.8999999761581421,
      "timings_per_token": false
    },
    "prompt": "",
    "next_token": {
      "has_next_token": true,
      "has_new_line": false,
      "n_remain": -1,
      "n_decoded": 0,
      "stopping_word": ""
    }
  },
  "total_slots": 1,
  "model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
  "chat_template": "..."
}

ggerganov · 2024-12-07T08:09:57Z


    std::vector<std::string> antiprompt;
    bool timings_per_token = false;
+    bool ignore_eos = false;


With new models the ignore_eos functionality is losing relevance. There are now many different "end-of-generation" tokens and it's not just a single EOS token anymore. We should remove this logic and only support logit biases, which is more general. Just a note, no need to do it in this PR.

ggerganov · 2024-12-08T19:54:52Z

This change breaks the infill endpoint - it produces mostly garbage.

ngxson · 2024-12-08T20:01:15Z

Hmm ok could be due to the infill "template" is not being applied correctly. I'll add a test with qwen model (run locally, not on CI)

ngxson · 2024-12-08T20:05:45Z

I'm on it, will make a PR

* server : (refactor) no more json in server_task input * add test for slots endpoint * add tests for /props and /slots * remove task inf_type * fix CI by adding safe_json_to_str * add "model_path" to /props * update readme

server : (refactor) no more json in server_task input

db97c8b

ngxson requested a review from ggerganov December 6, 2024 14:12

github-actions Bot added examples python python script changes server labels Dec 6, 2024

ggerganov approved these changes Dec 7, 2024

View reviewed changes

ngxson added 5 commits December 7, 2024 13:56

add test for slots endpoint

9bb1ae6

Merge branch 'master' into xsn/refactor_server_struct_input

6bf6e30

add tests for /props and /slots

e721f4c

remove task inf_type

090a113

fix CI by adding safe_json_to_str

65d2e6d

ggerganov approved these changes Dec 7, 2024

View reviewed changes

add "model_path" to /props

1949f68

ngxson mentioned this pull request Dec 7, 2024

Misc. bug: server - GET /props model value no longer works after commit 6c5bc06 #10705

Closed

update readme

89c2af9

ngxson mentioned this pull request Dec 7, 2024

changelog : llama-server REST API #9291

Open

ngxson merged commit 3573fa8 into ggml-org:master Dec 7, 2024

ngxson mentioned this pull request Dec 8, 2024

server : fix format_infill #10724

Merged

ggerganov mentioned this pull request Dec 8, 2024

server : fix infill prompt format #10725

Closed

ngxson mentioned this pull request Jan 4, 2025

Misc. bug: Server response "model" seems to be incorrect #11069

Closed

jakexcosme mentioned this pull request Oct 22, 2025

changelog : llama-server REST API COG-GTM/llama.cpp#245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : (refactor) no more json in server_task input#10691

server : (refactor) no more json in server_task input#10691
ngxson merged 8 commits into
ggml-org:masterfrom
ngxson:xsn/refactor_server_struct_input

ngxson commented Dec 6, 2024 •

edited

Loading

Uh oh!

Uh oh!

ggerganov Dec 7, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Dec 8, 2024

Uh oh!

ngxson commented Dec 8, 2024

Uh oh!

ngxson commented Dec 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ngxson commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ggerganov Dec 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Dec 8, 2024

Uh oh!

ngxson commented Dec 8, 2024

Uh oh!

ngxson commented Dec 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Dec 6, 2024 •

edited

Loading