Log inSign up
Scale Labs
139 posts
Image
user avatar
Scale Labs
@ScaleAILabs
welcome to the lab. from the researchers at @scale_AI
labs.scale.com
Joined October 2025
109
Following
2,046
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • user avatar
    Scale Labs
    @ScaleAILabs
    Jul 2
    ICML 2026 is almost here, and we're headed to Seoul with 7 accepted papers. 🇰🇷 Our research spans agent training, scientific reasoning, and evaluations advancing the next generation of AI systems. Here's an early look at the work we'll be presenting 🧵
    2.2K
    user avatar
    Scale Labs
    @ScaleAILabs
    Jul 2
    Replying to @ScaleAILabs
    RubricRobustness: Evaluating the Sensitivity of Rubrics-Based Benchmarks to Simple Perturbations openreview.net/forum?id=2Y3I3…
    Image
    190
    user avatar
    Scale Labs
    @ScaleAILabs
    Jul 2
    Connect with us at ICML: scale.com/events/icml-20…
    148
  • user avatar
    Scale Labs
    @ScaleAILabs
    Jul 1
    Less than a year ago, we launched the Remote Labor Index with @CAIS to measure how well AI agents can complete paid freelance work. Since then, the best automation rate has jumped from less than 3% to 16%. We've also added results for three frontier models to the leaderboard:
    Image
    00:00
    2.2K
    user avatar
    Scale Labs
    @ScaleAILabs
    Jul 1
    Full leaderboard:
    Image
    Remote Labor Index (RLI)
    From labs.scale.com
    205
  • user avatar
    Scale Labs
    @ScaleAILabs
    Jun 30
    AI is transforming drug discovery, but until now there hasn't been an independent benchmark for the computational tasks behind early-stage research. Together with @phylo_bio, we built DrugDiscoveryBench to evaluate how today's leading AI models perform on the core tasks
    Image
    00:00
    5.5K
    user avatar
    Scale Labs
    @ScaleAILabs
    Jun 30
    More on what we learned: scale.com/blog/drugdisco…
    392
  • Scale Labs reposted
    user avatar
    Mohit Raghavendra (@ ACL)
    @mohit_r9a
    Jun 30
    📝 New research from @scale_AI Frontier SWE benchmarks are usually single-turn, one-shot tasks: the agent gets a detailed spec upfront, then implements autonomously. That is not how most real coding-agent workflows feel. Introducing SWE-Interact. 🧵
    Image
    22K
  • user avatar
    Scale Labs
    @ScaleAILabs
    Jun 24
    GLM 5.2 is now live across all three SWE Atlas leaderboards: Codebase QnA, Test Writing, and Refactoring. When we launched SWE Atlas a few months ago, open models were significantly behind frontier closed models, clustering at the bottom. GLM 5.2 has leapfrogged this gap, and is
    Image
    30K
    user avatar
    Scale Labs
    @ScaleAILabs
    Jun 24
    More results across SWE Atlas leaderboards:
    Image
    AI Model Leaderboards & Benchmarks
    From labs.scale.com
    1.3K
  • Scale Labs reposted
    user avatar
    MohammadHossein Rezaei
    @mhrezaeics
    Jun 12
    Rubrics are becoming the standard way to train/evaluate LLMs on open-ended tasks. But rubric-RL has a bottleneck: every rollout needs to be graded by an LLM verifier. That’s expensive, slow, and is prone to reward hacking. At the same time, the field is moving toward on-policy
    Rubrics have emerged as an alternative to RLVR in open-ended domains where a single groundtruth final answer is not available. Existing rubric-based training methods rely on an LLM verifier
that scores each rollout against rubrics. This introduces substantial training-time overhead, exposes optimization to verifier-specific biases, and reduces rubric feedback to a sparse end-of-trajectory
signal. We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method in which the base policy, conditioned on the rubric, serves as the teacher for the unconditioned student.
RGSD distills the rubric-conditioned teacher distribution into the student token-by-token, replacing sparse trajectory-level rewards with dense per-token learning signals and removing the LLM judge
from the training loop entirely. Across Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models on medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based
GRPO while using one on-poli
    29K
  • Scale Labs reposted
    user avatar
    Afra Feyza Akyürek
    @afeyzaakyurek
    Jun 5
    Excited to share a new @ScaleAILabs research in collaboration with @phylo_bio on coding agents for drug-discovery research! 💊 We ran Claude Code, Codex, and Gemini on 60+ expert-curated drug-discovery tasks inside a shared Biomni-powered biomedical research environment and the
    Image
    12K
  • Scale Labs reposted
    user avatar
    Akshay
    @akshay_manglik
    Jun 2
    How do you turn agent traces into an improvement flywheel? Excited to share Insights Generator (IG) — new @scale_AI / @ScaleAILabs research that finds behavioral patterns and bugs in agent traces. Engineers & coding agents using IG achieved 30+% gains on agent benchmarks. 🧵
    Image
    883
  • user avatar
    Scale Labs
    @ScaleAILabs
    Jun 1
    Today we're releasing HiL-Dynamics, the first open-source tool that measures how production agents actually collaborate with humans under uncertainty. Not just whether they got the answer. Now you can measure exactly when your agent asks for help, when it makes assumptions, and
    Image
    5.9K
    user avatar
    Scale Labs
    @ScaleAILabs
    Jun 1
    Replying to @ScaleAILabs
    Selective escalation remains one of the biggest challenges for reliable human-in-the-loop AI. We hope HiL-Dynamics helps users find the right setup for their workflows and gives model builders clearer signals for building agents that collaborate with humans more effectively.
    396
    user avatar
    Scale Labs
    @ScaleAILabs
    Jun 1
    HiL-Dynamics: github.com/melfeki-11/HiL… Blog: labs.scale.com/blog/hil-dynam…
    Image
    GitHub - melfeki-11/HiL-Dynamics: Does your coding agent know when it doesn't know? HiL-Dynamics...
    From github.com
    389