Labelbox (@labelbox) / X

Labelbox

279 posts

Labelbox

@labelbox

Frontier RL data for the world’s leading AI teams.

San Francisco, CA

Joined January 2018

Labelbox
@labelbox
Jun 25
1/ Today, we’re introducing Recursion: the RL platform for building, evaluating, and deploying specialist agents. The next phase of AI won’t just be about smarter models. It will be about systems that learn from the unique expertise of every organization. 🧵
1.8M
Labelbox
@labelbox
Jun 25
Replying to @labelbox
8/ Every execution becomes a learning signal. Each rollout generates graded trajectories that improve specialist models through fine-tuning and reinforcement learning, grounded in real enterprise outcomes.
217
Labelbox
@labelbox
Jun 25
9/ As agents take on more responsibility across the enterprise, the ability to evaluate, learn, and improve from real execution will become a competitive advantage. Reliable agents require more than a strong model. They require a system for continuous improvement. Learn how
Introducing Recursion: The RL platform for enterprise specialist agents
From labelbox.com
87
Labelbox
@labelbox
Jun 16
Where do models change their minds? Natural Language Autoencoders (NLAs) offer a promising way to translate a model’s internal representations into natural language. But the harder question is: where do meaningful decisions actually happen? We tested a workflow for finding
3M
Labelbox
@labelbox
Jun 16
Full post here:
Where models change their minds: Identifying branchpoints for NLA training
From labelbox.com
474
Labelbox
@labelbox
May 20
When AI benchmarks saturate, what comes next? Historically, leaderboard saturation leads to two paths: hyper-specialized questions or increasingly abstract puzzles. A new paper from @Meta Superintelligence Labs introduces a third path: GIM (Grounded Integration Measure).
When benchmarks saturate, what comes next? Meta’s GIM pushes AI evaluation toward integrated...
From labelbox.com
11M
Labelbox
@labelbox
Apr 23
This week, we had the pleasure of hosting 50+ researchers and builders from leading AI companies to meet, talk and socialize (MTS 😎) at Labelbox HQ. Huge thanks to @dwarkesh_sp, Sholto Douglas (Anthropic), Mo Bavarian (OpenAI), and Melvin Johnson (DeepMind) for leading our
1.6M
Labelbox
@labelbox
Mar 5
Interrupt a voice agent mid-sentence and most models struggle to stay aligned with the original objective. We built EchoChain 🔊, a benchmark for reasoning under interruption in full-duplex dialogue. Current pass rates: • Gemini Live: 16.5% • Nova Sonic 2: 26% •
4.2K
Labelbox
@labelbox
Mar 5
@elonmusk 🔥
1.7K
Labelbox
@labelbox
Mar 4
Voice agents are moving beyond rigid turn based systems toward real time, natural conversation, streaming understanding and generation simultaneously. However, most existing benchmarks are either turn-based or latency-focused and do not directly test whether models can maintain
Introducing EchoChain: An audio benchmark for reasoning under pressure in full-duplex dialogue
From labelbox.com
2.7M
Labelbox
@labelbox
Mar 4
Replying to @labelbox
Common failure modes for leading audio models fall under three categories: (1) contextual inertia, (2) interruption amnesia, and (3) objective displacement. Read the full blog post for a deep dive into these types of failures.
1.6K
Labelbox
@labelbox
Mar 4
Here’s a sample audio clip from EchoChain showing an objective displacement failure in OpenAI GPT-realtime-2025-08-28. The conversation first establishes baseline context, then introduces an interruption, and we check whether the model stays aligned with the original goal after
1.2K
Labelbox
@labelbox
Feb 20
AI safety is often judged by refusal rates on adversarial benchmarks. But what if we are measuring keyword sensitivity, not real robustness? In our latest research, we found that removing obvious trigger cues causes frontier models previously labeled as safe to fail, revealing a
The AI safety illusion: why current safety datasets fool us on model safety
From labelbox.com
3.8M
Labelbox
@labelbox
Feb 20
Replying to @labelbox
We extend intent laundering into a standalone jailbreaking method by adding an iterative revision–regeneration loop, where failed attempts are fed back into the model to produce increasingly refined rewrites. With only a few iterations, attack success rates rise to 90–98% across
1.4K
Labelbox
@labelbox
Feb 20
Our research reveals a blind spot in AI safety evaluations. Current benchmarks rely too heavily on unrealistic trigger cues and fail to reflect real-world adversarial behavior. This creates a mismatch, testing models under conditions that rarely occur in practice. This does not
963