Log inSign up
Labelbox
279 posts
Image
user avatar
Labelbox
@labelbox
Frontier RL data for the world’s leading AI teams.
San Francisco, CA
labelbox.com
Joined January 2018
147
Following
3,496
Followers
6
Subscriptions
  • user avatar
    Labelbox
    @labelbox
    7h
    1/ Today, we’re introducing Recursion: the RL platform for building, evaluating, and deploying specialist agents. The next phase of AI won’t just be about smarter models. It will be about systems that learn from the unique expertise of every organization. 🧵
    530K
    user avatar
    Labelbox
    @labelbox
    7h
    Replying to @labelbox
    8/ Every execution becomes a learning signal. Each rollout generates graded trajectories that improve specialist models through fine-tuning and reinforcement learning, grounded in real enterprise outcomes.
    Every rollout produces graded trajectories that can feed fine-tuning and reinforcement learning. The result is a specialist model that improves from enterprise execution signals instead of synthetic benchmarks alone.
    64
    user avatar
    Labelbox
    @labelbox
    7h
    9/ As agents take on more responsibility across the enterprise, the ability to evaluate, learn, and improve from real execution will become a competitive advantage. Reliable agents require more than a strong model. They require a system for continuous improvement. Learn how
    Introducing Recursion: The RL platform for enterprise specialist agents
    Introducing Recursion: The RL platform for enterprise specialist agents
    From labelbox.com
    34
  • user avatar
    Labelbox
    @labelbox
    Jun 16
    Where do models change their minds? Natural Language Autoencoders (NLAs) offer a promising way to translate a model’s internal representations into natural language. But the harder question is: where do meaningful decisions actually happen? We tested a workflow for finding
    Image
    2.8M
    user avatar
    Labelbox
    @labelbox
    Jun 16
    Full post here:
    Where models change their minds: Identifying branchpoints for NLA training
    Where models change their minds: Identifying branchpoints for NLA training
    From labelbox.com
    447
  • user avatar
    Labelbox
    @labelbox
    May 20
    When AI benchmarks saturate, what comes next? Historically, leaderboard saturation leads to two paths: hyper-specialized questions or increasingly abstract puzzles. A new paper from @Meta Superintelligence Labs introduces a third path: GIM (Grounded Integration Measure).
    When benchmarks saturate, what comes next? Meta’s GIM pushes AI evaluation toward integrated reasoning
    When benchmarks saturate, what comes next? Meta’s GIM pushes AI evaluation toward integrated...
    From labelbox.com
    11M
  • user avatar
    Labelbox
    @labelbox
    Apr 23
    This week, we had the pleasure of hosting 50+ researchers and builders from leading AI companies to meet, talk and socialize (MTS 😎) at Labelbox HQ. Huge thanks to @dwarkesh_sp, Sholto Douglas (Anthropic), Mo Bavarian (OpenAI), and Melvin Johnson (DeepMind) for leading our
    Image
    Image
    Image
    1.6M
  • user avatar
    Labelbox
    @labelbox
    Mar 5
    Interrupt a voice agent mid-sentence and most models struggle to stay aligned with the original objective. We built EchoChain 🔊, a benchmark for reasoning under interruption in full-duplex dialogue. Current pass rates: • Gemini Live: 16.5% • Nova Sonic 2: 26% •
    Image
    4.2K
    user avatar
    Labelbox
    @labelbox
    Mar 5
    @elonmusk 🔥
    1.7K
  • user avatar
    Labelbox
    @labelbox
    Mar 4
    Voice agents are moving beyond rigid turn based systems toward real time, natural conversation, streaming understanding and generation simultaneously. However, most existing benchmarks are either turn-based or latency-focused and do not directly test whether models can maintain
    Introducing EchoChain: An audio benchmark for reasoning under pressure in full-duplex dialogue
    Introducing EchoChain: An audio benchmark for reasoning under pressure in full-duplex dialogue
    From labelbox.com
    2.7M
    user avatar
    Labelbox
    @labelbox
    Mar 4
    Replying to @labelbox
    Common failure modes for leading audio models fall under three categories: (1) contextual inertia, (2) interruption amnesia, and (3) objective displacement. Read the full blog post for a deep dive into these types of failures.
    Image
    1.6K
    user avatar
    Labelbox
    @labelbox
    Mar 4
    Here’s a sample audio clip from EchoChain showing an objective displacement failure in OpenAI GPT-realtime-2025-08-28. The conversation first establishes baseline context, then introduces an interruption, and we check whether the model stays aligned with the original goal after
    1.2K
  • user avatar
    Labelbox
    @labelbox
    Feb 20
    AI safety is often judged by refusal rates on adversarial benchmarks. But what if we are measuring keyword sensitivity, not real robustness? In our latest research, we found that removing obvious trigger cues causes frontier models previously labeled as safe to fail, revealing a
    Image
    The AI safety illusion: why current safety datasets fool us on model safety
    From labelbox.com
    3.8M
    user avatar
    Labelbox
    @labelbox
    Feb 20
    Replying to @labelbox
    We extend intent laundering into a standalone jailbreaking method by adding an iterative revision–regeneration loop, where failed attempts are fed back into the model to produce increasingly refined rewrites. With only a few iterations, attack success rates rise to 90–98% across
    Image
    1.4K
    user avatar
    Labelbox
    @labelbox
    Feb 20
    Our research reveals a blind spot in AI safety evaluations. Current benchmarks rely too heavily on unrealistic trigger cues and fail to reflect real-world adversarial behavior. This creates a mismatch, testing models under conditions that rarely occur in practice. This does not
    955

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up