Skip to content

[Bug]: PromptCachingCache extract_cacheable_prefix broken when message.content is a string? #19228

Description

@nuernber

What happened?

PromptCachingCache's extract_cacheable_prefix function may return an empty prefix when message.content is a string:

for msg_idx, message in enumerate(messages):
content = message.get("content")
if not isinstance(content, list):
continue

This appears to break API requests with bodies that look like the following, which in my testing with LiteLLM connected to AWS Bedrock allows for cache creation/reads:

{
    "model": model_id,
    "stream": False,
    "max_tokens": 1024,
    "messages": [
        {"role": "system", "content": "You are an LLM named Prompt Cache Helper"},
        {
            "role": "user", 
            "content": large_message,
            "cache_control": {
                "type": "ephemeral",
                "ttl": "5m"
            }
        },
    ],
}

But this approach works:

{
    "model": model_id,
    "stream": False,
    "max_tokens": 1024,
    "messages": [
        {
            "role": "system",
            "content": "You are an Prompt Cache Helper"
        },
        {
            "role": "user", 
            "content": [
                {
                    "type": "text",
                    "text": large_message,
                    "cache_control": {
                        "type": "ephemeral",
                        "ttl": "5m" # or 5m
                    }
                },
            ]
        },
    ],
}

Note that these API specs for Claude and OpenAI don't say that cache_control can be a sibling key of content, so not sure if this is actually a bug, or if LiteLLM or Bedrock is just more flexible and allows for cache_control to be a sibling key...

Steps to Reproduce

  1. Config to use the prompt caching precheck:
router_settings:
  enable_pre_call_checks: true
  optional_pre_call_checks: ["prompt_caching"]
  1. Send an API request for caching with the cache_control as a sibling of a content that is a string:
{
   "model": model_id,
   "stream": False,
   "max_tokens": 1024,
   "messages": [
       {"role": "system", "content": "You are an LLM named Prompt Cache Helper"},
       {
           "role": "user", 
           "content": large_message,
           "cache_control": {
               "type": "ephemeral",
               "ttl": "5m"
           }
       },
   ],
}
  1. Send a few of those API requests to see cache write and reads. For example, in the response you might see something like:
  "usage": {
    "completion_tokens": 1024,
    "prompt_tokens": 5425,
    "total_tokens": 6449,
    "prompt_tokens_details": {
      "cached_tokens": 5425,
      "cache_creation_tokens": 0
    },
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 5425
  }
  1. If you debug into the code, you'll notice that the cache key prefix is None, which is incorrect because the usage results in step 3 above showed that the cache is being used:
    # Generate cache key using cacheable prefix
    cache_key = PromptCachingCache.get_prompt_caching_cache_key(messages, tools)
    if cache_key is None:
    return None

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.80.11-stable

Twitter / LinkedIn details

No response

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions