Arena.ai (@arena) / X

Arena.ai

3,343 posts

Arena.ai

@arena

Where AI meets the real world. Formerly LMArena. We measure and advance the frontier of AI through community-driven evaluation. We’re hiring → arena.ai/jobs

Joined March 2023

Pinned
Arena.ai
@arena
Jun 4
Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image
00:00
268K
Arena.ai
@arena
Jun 16
GLM-5.2 (Max) by @Zai_org ranks #10 on the new Agent Arena leaderboard, closely matching Claude-Opus-4.8 (non-thinking) and is the #1 open model by a wide margin! In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of
622K
Arena.ai
@arena
20h
Curious about GLM-5.2 and haven’t tested it yet? Check out first impressions with @petergostev
5.8K
Arena.ai
@arena
Jun 18
Agent Arena's causal tracing methodology lets us quantify the real value of humans working together with AI agents, and observe a huge range of model behaviors from the same traces. We started with 5 signals: confirmed success, praise vs. complaint, steerability, bash recovery,
00:00
Arena.ai
@arena
Jun 17
Agent Arena has been live for 2 weeks, with 10 more models now on the new leaderboard. Two highlights worth mentioning: - GLM-5.2 (Max) by @Zai_org enters the top 10. The strongest open-weight result we've measured, at +9.4% confirmed success and +14.9% praise-vs-complaint
9.8K
Arena.ai
@arena
Jun 18
Learn more about how we built the methodology behind Agent Arena:
Agent Arena: Causal Evaluation of Agents in the Real World
From arena.ai
3.3K
Arena.ai
@arena
Jun 17
Agent Arena has been live for 2 weeks, with 10 more models now on the new leaderboard. Two highlights worth mentioning: - GLM-5.2 (Max) by @Zai_org enters the top 10. The strongest open-weight result we've measured, at +9.4% confirmed success and +14.9% praise-vs-complaint
Arena.ai
@arena
Jun 4
Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex
49K
Arena.ai
@arena
Jun 17
Learn more about the causal tracing methodology for Agent Arena on our blog:
Agent Arena: Causal Evaluation of Agents in the Real World
From arena.ai
5.2K
Arena.ai
@arena
Jun 17
Head over to the Agent Arena leaderboard to see the data in detail:
Agent Arena | AI Agent Performance Leaderboard
From arena.ai
4.1K
Arena.ai
@arena
Jun 17
Kimi K2.7 Code by @Kimi_Moonshot ranks #19 overall on the new Agent Arena leaderboard, and #6 among open models. In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of users. Models can access web search, filesystem,
Kimi.ai
@Kimi_Moonshot
Jun 12
🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower
31K
Arena.ai
@arena
Jun 17
Replying to @arena
Learn more about the causal tracing methodology for Agent Arena on our blog:
Agent Arena: Causal Evaluation of Agents in the Real World
From arena.ai
3.7K
Arena.ai
@arena
Jun 17
Head over to the Agent Arena leaderboard and filter by open models or view by lab:
Agent Arena | AI Agent Performance Leaderboard
From arena.ai
3.2K
Arena.ai reposted
Arena.ai
@arena
Jun 16
Replying to @arena
GLM-5.2 (Max) ranks #10 overall (+4.4%) - tied for #1 Tool Hallucination (+1.9%) - #3 Confirmed Task Success (+9.4%) - #3 Praise vs. Complaint (+14.9%) - #16 Bash Recovery (+1.7%) - #20 Steerability (-6.0%)
13K
Arena.ai reposted
Anastasios Nikolas Angelopoulos
@ml_angelopoulos
Jun 16
Just to be clear, if you remove Fable which is unavaialble, GLM-5.2 (Max) is the #1 model in the world for frontend coding. This is a huge moment. OSS has caught up with proprietary, and China has caught up with the US, in this very important domain.
Arena.ai
@arena
Jun 16
Exciting news: GLM-5.2 (Max) ranks #2 in Code Arena: Frontend, with +29pt over Claude Opus 4.7 (Thinking) and only behind Fable 5! GLM-5.2 is the best open model vs Kimi-K2.6 and Minimax-M3 by a large margin. - #2 React and #4 HTML sub-leaderboards - Ranks as the top model in
588K
Arena.ai reposted
jietang
@jietang
Jun 16
We're introducing GLM-5.2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a solid 1M-token context. GLM-5.2's new capabilities include:
346K
Arena.ai reposted
Nathan Lambert
@natolambert
Jun 16
It's hard to pinpoint open-closed gap and so-on, but I trust the @arena team and just look where GLM 5.2 is on this. An MIT licensed, to be open weight model. At this point you could argue they have a better agent than Gemini does. That's a serious accomplishment.
45K
Arena.ai
@arena
Jun 16
Exciting news: GLM-5.2 (Max) ranks #2 in Code Arena: Frontend, with +29pt over Claude Opus 4.7 (Thinking) and only behind Fable 5! GLM-5.2 is the best open model vs Kimi-K2.6 and Minimax-M3 by a large margin. - #2 React and #4 HTML sub-leaderboards - Ranks as the top model in
Arena.ai
@arena
Jun 16
GLM-5.2 (Max) by @Zai_org ranks #10 on the new Agent Arena leaderboard, closely matching Claude-Opus-4.8 (non-thinking) and is the #1 open model by a wide margin! In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of
1.3M
Arena.ai
@arena
Jun 16
Replying to @arena
Head over to look into all the Arena leaderboard details at: arena.ai/leaderboard
15K
Arena.ai
@arena
Jun 16
GLM-5.2 (Max) also moves the Pareto Frontier for Code Arena: Frontend.
14K
Arena.ai
@arena
Jun 16
Replying to @arena
Learn more about the causal tracing methodology for Agent Arena on our blog:
Agent Arena: Causal Evaluation of Agents in the Real World
From arena.ai
5.3K
Arena.ai
@arena
Jun 16
Head over to the Agent Arena leaderboard to dive into the details. You can also filter by open models or view by lab:
Agent Arena | AI Agent Performance Leaderboard
From arena.ai
4.9K