Skip to content

Streaming TTS crashes on MPS/CPU with SDPA attention (transformers 4.57) #312

@eddieran

Description

@eddieran

Bug Report

Streaming TTS inference crashes immediately on any device using SDPA attention — this includes:

  • Apple Silicon (MPS) — all Mac users
  • CPU — any machine without a GPU
  • CUDA without flash_attn — e.g. fresh installs, Docker images without flash-attn

The model loads successfully but crashes on the first text-window forward pass during generation.

Error

RuntimeError: The expanded size of the tensor (228) must match the existing size (223)
at non-singleton dimension 3.
Target sizes: [1, 14, 5, 228].  Tensor sizes: [1, 1, 5, 223]

Steps to reproduce

# On any Mac, or any machine without flash_attn installed
python demo/vibevoice_realtime_demo.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --device mps  # or cpu

# Open browser → type any text → crash on first generation

Root cause

MockCacheLayer.get_mask_sizes() does not include query_length in the returned kv_length, violating the DynamicLayer contract in transformers 4.57.

The canonical implementation (transformers.cache_utils.DynamicLayer, L123-128):

def get_mask_sizes(self, cache_position):
    query_length = cache_position.shape[0]
    kv_length = self.get_seq_length() + query_length  # ← includes query_length
    return kv_length, 0

Current MockCacheLayer:

def get_mask_sizes(self, cache_position):
    kv_length = self.key_cache.shape[2]  # ← missing query_length
    return kv_length, 0

This makes the causal mask query_length tokens too short. SDPA strictly requires the mask to match the KV tensor dimensions, so it crashes. Flash Attention 2 computes causality internally and never calls get_mask_sizes, which is why the bug is latent on CUDA+flash_attn.

Environment

Component Version
macOS 26.4
Hardware Apple M4 Max
Python 3.11.15
PyTorch 2.11.0
Transformers 4.57.6

Fix

A one-line fix is available in #303.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions