Labelbox
279 posts
Frontier RL data for the world’s leading AI teams.
- 9/ As agents take on more responsibility across the enterprise, the ability to evaluate, learn, and improve from real execution will become a competitive advantage. Reliable agents require more than a strong model. They require a system for continuous improvement. Learn how
- Where do models change their minds? Natural Language Autoencoders (NLAs) offer a promising way to translate a model’s internal representations into natural language. But the harder question is: where do meaningful decisions actually happen? We tested a workflow for finding
- When AI benchmarks saturate, what comes next? Historically, leaderboard saturation leads to two paths: hyper-specialized questions or increasingly abstract puzzles. A new paper from @Meta Superintelligence Labs introduces a third path: GIM (Grounded Integration Measure).
- This week, we had the pleasure of hosting 50+ researchers and builders from leading AI companies to meet, talk and socialize (MTS 😎) at Labelbox HQ. Huge thanks to @dwarkesh_sp, Sholto Douglas (Anthropic), Mo Bavarian (OpenAI), and Melvin Johnson (DeepMind) for leading our
- Voice agents are moving beyond rigid turn based systems toward real time, natural conversation, streaming understanding and generation simultaneously. However, most existing benchmarks are either turn-based or latency-focused and do not directly test whether models can maintainHere’s a sample audio clip from EchoChain showing an objective displacement failure in OpenAI GPT-realtime-2025-08-28. The conversation first establishes baseline context, then introduces an interruption, and we check whether the model stays aligned with the original goal after
- AI safety is often judged by refusal rates on adversarial benchmarks. But what if we are measuring keyword sensitivity, not real robustness? In our latest research, we found that removing obvious trigger cues causes frontier models previously labeled as safe to fail, revealing aReplying to @labelboxWe extend intent laundering into a standalone jailbreaking method by adding an iterative revision–regeneration loop, where failed attempts are fed back into the model to produce increasingly refined rewrites. With only a few iterations, attack success rates rise to 90–98% acrossOur research reveals a blind spot in AI safety evaluations. Current benchmarks rely too heavily on unrealistic trigger cues and fail to reflect real-world adversarial behavior. This creates a mismatch, testing models under conditions that rarely occur in practice. This does not









