Expressiveness
Emotion and emphasis. Stressing the right words, sounding like it means what it says instead of reading text aloud.
The Humanness Index™
Sounding human is hard to measure, but it's what decides whether a call works. We clone one voice onto every model and play them blind against a real human, so you can hear which ones pass.
Read the whitepaperSame voice, different models.
So I can see here that the package was marked as delivered on Tuesday, but if you're saying it never arrived then what we'll do is... let me just. Yeah, I'm going to open a lost package investigation for you. That usually takes about forty-eight hours to resolve.
←→ play each side · space vote, then next pair
Step 1
We clone one conversational voice onto every model, so you're judging the model, not its demo reel.
Step 2
Two voices, same line, no labels. Pick the one that sounds more human.
Step 3
Blind votes are fit into a rating, with a real human at 100. The higher the score, the more human the model sounds.
21 Models9 providers10250 unique votes
Why latency matters. A voice that lags breaks the conversation, no matter how human it sounds.
| Likely Rank | Model | Listen | ||||||
|---|---|---|---|---|---|---|---|---|
| Baseline | Human | Homo Sapien | 100 | 1301 | — | — | 546 | |
| #1–5 | xAI | Grok TTS | 94 | 1283 | 460 ms | $15 | 525 | |
| #1–6 | MiniMax | Speech 2.5 | 93 | 1278 | 325 ms | $60 | 502 | |
| #1–6 | ElevenLabs | Eleven v3 | 92 | 1276 | 758 ms | $100 | 507 | |
| #1–6 | Canopy Labs | Orpheus | 91 | 1271 | — | Open source | 495 | |
| #1–7 | MiniMax | Speech 2 HD | 88 | 1263 | 357 ms | $100 | 487 | |
| #2–7 | xAI | Grok TTS (Streaming) | 88 | 1261 | 285 ms | $15 | 490 | |
| #6–11 | Inworld | TTS-1.5-max | 78 | 1231 | 337 ms | $35 | 428 | |
| #7–12 | ElevenLabs | Flash v2 | 76 | 1225 | 226 ms | $50 | 425 | |
| #7–15 | ElevenLabs | Multilingual v2 | 75 | 1221 | 1006 ms | $100 | 423 | |
| #7–15 | ElevenLabs | Turbo v2 | 73 | 1216 | 302 ms | $50 | 414 |
The Index only includes models that support voice cloning: each battle plays the same cloned source voice through both models, so the comparison is head to head and fair. Don't see your model on this list? Contact us at humannessindex@vapi.ai.
What we Listen for
Humanness doesn't break down into features. You either believe there's a person on the other end, or you don't. When that belief breaks, it's usually because of one of these.
Emotion and emphasis. Stressing the right words, sounding like it means what it says instead of reading text aloud.
The intonation, rhythm, and melody of speech. The natural rise and fall of how people actually talk.
The little human sounds: breaths, stutters, natural pauses. A voice with none of them sounds too clean to be real.
Any model can sound good on its own demo voice. The real test is how it handles your use case. We clone one voice across every model so the comparison is fair. Models that can't clone a voice can't be tested fairly, so they're not listed.
xAI's TTS is the voice to beat for naturalness. In blind, side-by-side comparisons, listeners pick it as the more human-sounding option more often than any other model on the Index, and it holds that edge at phone quality, where most voices start to sound synthetic. For teams where sounding human is the whole point, it's where we'd start.
Rankings are provisional
We're keeping the podium under wraps while the votes come in. Listen and vote above, and the most human models reveal once the standings settle.
Picking a TTS model for a voice agent comes down to one thing: does it sound human enough that people forget they're talking to software? You can't get that from demos or vendor claims. So we made it measurable and took the call out of our own hands: one voice cloned onto every model, played blind with no names attached, scored against a real human by the people who hear it.
Compare the models in blind tests, read the methodology, or get in touch.
Build a TTS model? Add yours to the Index.