Log inSign up
Arena.ai
3,343 posts
Image
user avatar
Arena.ai
@arena
Where AI meets the real world. Formerly LMArena. We measure and advance the frontier of AI through community-driven evaluation. We’re hiring → arena.ai/jobs
US
arena.ai
Joined March 2023
215
Following
170K
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms of Service|Privacy Policy|Cookie Policy|Accessibility|Ads info|© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • Pinned
    user avatar
    Arena.ai
    @arena
    Jun 4
    Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image
    Image
    00:00
    268K
  • user avatar
    Arena.ai
    @arena
    Jun 16
    GLM-5.2 (Max) by @Zai_org ranks #10 on the new Agent Arena leaderboard, closely matching Claude-Opus-4.8 (non-thinking) and is the #1 open model by a wide margin! In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of
    Image
    622K
    user avatar
    Arena.ai
    @arena
    20h
    Curious about GLM-5.2 and haven’t tested it yet? Check out first impressions with @petergostev
    5.8K
  • user avatar
    Arena.ai
    @arena
    Jun 18
    Agent Arena's causal tracing methodology lets us quantify the real value of humans working together with AI agents, and observe a huge range of model behaviors from the same traces. We started with 5 signals: confirmed success, praise vs. complaint, steerability, bash recovery,
    Image
    00:00
    Image
    user avatar
    Arena.ai
    @arena
    Jun 17
    Agent Arena has been live for 2 weeks, with 10 more models now on the new leaderboard. Two highlights worth mentioning: - GLM-5.2 (Max) by @Zai_org enters the top 10. The strongest open-weight result we've measured, at +9.4% confirmed success and +14.9% praise-vs-complaint
    9.8K
    user avatar
    Arena.ai
    @arena
    Jun 18
    Learn more about how we built the methodology behind Agent Arena:
    Image
    Agent Arena: Causal Evaluation of Agents in the Real World
    From arena.ai
    3.3K
  • user avatar
    Arena.ai
    @arena
    Jun 17
    Agent Arena has been live for 2 weeks, with 10 more models now on the new leaderboard. Two highlights worth mentioning: - GLM-5.2 (Max) by @Zai_org enters the top 10. The strongest open-weight result we've measured, at +9.4% confirmed success and +14.9% praise-vs-complaint
    Image
    Image
    user avatar
    Arena.ai
    @arena
    Jun 4
    Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex
    49K
    user avatar
    Arena.ai
    @arena
    Jun 17
    Learn more about the causal tracing methodology for Agent Arena on our blog:
    Image
    Agent Arena: Causal Evaluation of Agents in the Real World
    From arena.ai
    5.2K
    user avatar
    Arena.ai
    @arena
    Jun 17
    Head over to the Agent Arena leaderboard to see the data in detail:
    Image
    Agent Arena | AI Agent Performance Leaderboard
    From arena.ai
    4.1K
  • user avatar
    Arena.ai
    @arena
    Jun 17
    Kimi K2.7 Code by @Kimi_Moonshot ranks #19 overall on the new Agent Arena leaderboard, and #6 among open models. In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of users. Models can access web search, filesystem,
    Image
    Image
    Image
    user avatar
    Kimi.ai
    @Kimi_Moonshot
    Jun 12
    🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower
    31K
    user avatar
    Arena.ai
    @arena
    Jun 17
    Replying to @arena
    Learn more about the causal tracing methodology for Agent Arena on our blog:
    Image
    Agent Arena: Causal Evaluation of Agents in the Real World
    From arena.ai
    3.7K
    user avatar
    Arena.ai
    @arena
    Jun 17
    Head over to the Agent Arena leaderboard and filter by open models or view by lab:
    Image
    Agent Arena | AI Agent Performance Leaderboard
    From arena.ai
    3.2K
  • Arena.ai reposted
    user avatar
    Arena.ai
    @arena
    Jun 16
    Replying to @arena
    GLM-5.2 (Max) ranks #10 overall (+4.4%) - tied for #1 Tool Hallucination (+1.9%) - #3 Confirmed Task Success (+9.4%) - #3 Praise vs. Complaint (+14.9%) - #16 Bash Recovery (+1.7%) - #20 Steerability (-6.0%)
    Image
    13K
  • Arena.ai reposted
    user avatar
    Anastasios Nikolas Angelopoulos
    Arena.ai
    @ml_angelopoulos
    Jun 16
    Just to be clear, if you remove Fable which is unavaialble, GLM-5.2 (Max) is the #1 model in the world for frontend coding. This is a huge moment. OSS has caught up with proprietary, and China has caught up with the US, in this very important domain.
    user avatar
    Arena.ai
    @arena
    Jun 16
    Exciting news: GLM-5.2 (Max) ranks #2 in Code Arena: Frontend, with +29pt over Claude Opus 4.7 (Thinking) and only behind Fable 5! GLM-5.2 is the best open model vs Kimi-K2.6 and Minimax-M3 by a large margin. - #2 React and #4 HTML sub-leaderboards - Ranks as the top model in
    Image
    588K
  • Arena.ai reposted
    user avatar
    jietang
    Z.ai
    @jietang
    Jun 16
    We're introducing GLM-5.2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a solid 1M-token context. GLM-5.2's new capabilities include:
    Image
    Image
    346K
  • Arena.ai reposted
    user avatar
    Nathan Lambert
    @natolambert
    Jun 16
    It's hard to pinpoint open-closed gap and so-on, but I trust the @arena team and just look where GLM 5.2 is on this. An MIT licensed, to be open weight model. At this point you could argue they have a better agent than Gemini does. That's a serious accomplishment.
    Image
    45K
  • user avatar
    Arena.ai
    @arena
    Jun 16
    Exciting news: GLM-5.2 (Max) ranks #2 in Code Arena: Frontend, with +29pt over Claude Opus 4.7 (Thinking) and only behind Fable 5! GLM-5.2 is the best open model vs Kimi-K2.6 and Minimax-M3 by a large margin. - #2 React and #4 HTML sub-leaderboards - Ranks as the top model in
    Image
    Image
    user avatar
    Arena.ai
    @arena
    Jun 16
    GLM-5.2 (Max) by @Zai_org ranks #10 on the new Agent Arena leaderboard, closely matching Claude-Opus-4.8 (non-thinking) and is the #1 open model by a wide margin! In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of
    1.3M
    user avatar
    Arena.ai
    @arena
    Jun 16
    Replying to @arena
    Head over to look into all the Arena leaderboard details at: arena.ai/leaderboard
    15K
    user avatar
    Arena.ai
    @arena
    Jun 16
    GLM-5.2 (Max) also moves the Pareto Frontier for Code Arena: Frontend.
    Image
    14K
  • user avatar
    Arena.ai
    @arena
    Jun 16
    Replying to @arena
    Learn more about the causal tracing methodology for Agent Arena on our blog:
    Image
    Agent Arena: Causal Evaluation of Agents in the Real World
    From arena.ai
    5.3K
    user avatar
    Arena.ai
    @arena
    Jun 16
    Head over to the Agent Arena leaderboard to dive into the details. You can also filter by open models or view by lab:
    Image
    Agent Arena | AI Agent Performance Leaderboard
    From arena.ai
    4.9K