Collinear AI (@CollinearAI) / X

Collinear AI

125 posts

Collinear AI

@CollinearAI

The AI Simulation Lab

Joined October 2023

Collinear AI
@CollinearAI
22h
What does it take to identify tasks in the narrow gap between "trivial" and "impossible" for the world's strongest models? We're discussing this and more with some of the field's leading researchers at our office in Sunnyvale next week, before we go off to ACL and ICML in July.
Collinear AI
@CollinearAI
Jun 23
Most “hard” problems are useless for training a model. The useful ones sit in a narrow learnable region where the model fails sometimes and succeeds sometimes. Here are some early results where we took 5 trivial and 5 impossible cybersecurity environments and had a model
Samosas and the Race to AGI · Luma
From luma.com
246
Collinear AI
@CollinearAI
Jun 23
Most “hard” problems are useless for training a model. The useful ones sit in a narrow learnable region where the model fails sometimes and succeeds sometimes. Here are some early results where we took 5 trivial and 5 impossible cybersecurity environments and had a model
1.7K
Collinear AI
@CollinearAI
Jun 23
Replying to @CollinearAI
6/7 For an initially hard example, one bug was buried behind confusing arithmetic and misleading “legacy” comments, so the solver never found it. The creator made it legible, a plain value instead of an obfuscated one, plus a hint at the right file. Now the solver sometimes
81
Collinear AI
@CollinearAI
Jun 23
7/7 In summary, with a few rounds of feedback, Opus 4.8 can reshape a task to land at the edge of GPT-5.5’s ability. Though an open question remains; is this due to the various iterations of feedback or because the creator has capabilities that the solver does not have?
72
Collinear AI
@CollinearAI
Jun 21
What are the real bottlenecks to AGI? Let's debate! Collinear HQ, Sunnyvale, June 29th Join researchers from xAI, Sierra and Amazon AGI for some hot takes, moderated by our own @nazneenrajani. Invite-only for ACL/ICML authors.
380
Collinear AI
@CollinearAI
Jun 22
Request invite here -->
Samosas and the Race to AGI · Luma
From luma.com
71
Collinear AI reposted
Sachin
@sachpatro97
Jun 21
There’s been a lot of talk about the new models getting scary good. Mythos on cybersecurity. GPT-5.5 Codex on coding. GLM 5.1 as the all-around daily driver. But the biggest AI research question we have is much simpler: Can it do it on a rainy night in Stoke? Little sneak peek
worldcupbench.com
WorldCupBench — Coming soon
An agentic LLM benchmark where models manage national teams through a simulated FIFA World Cup 2026. From Collinear AI.
367
Collinear AI
@CollinearAI
Jun 17
1/n We benchmarked the top-4 models for solving real-life cybersecurity vulnerabilities on a collection of 221 tasks which covering more than 150+ CWEs. And found some interesting patterns..
2.4K
Collinear AI
@CollinearAI
Jun 17
Replying to @CollinearAI
3/n On the same set of 24 tasks Opus and Fable have the same pass@1 but Fable has a higher pass@4 showing that Fable has more variance in it rollouts - potentially making it interesting to scale parallel test time compute with!
114
Collinear AI
@CollinearAI
Jun 17
4/n On one of the tasks involving a Java 100K+ codebase with 4 vulnerabilities (CWE94, CWE503, CWE862 and CWE863) both the models fixed the three obvious bug but the fourth one (CWE863) which is more subtle was only solved by Fable.
98
Collinear AI
@CollinearAI
Jun 9
We are hiring MTS with backgrounds in research, ML, engineering, product, and customer facing deployments. Come get your seat on the rocket ship 🚀
Nazneen Rajani
@nazneenrajani
Jun 9
In January of this year, the number of MTS with PhDs @CollinearAI was 2. Today it is 8 and 3 more joining next month. We believe we are building something special that lies on the critical path to AGI. We are not like the other RL env companies. Not every hill is worth
281
Collinear AI
@CollinearAI
Jun 4
We are pleased to see that the latest MAI-Thinking-1 model is strongly sustained by a synthetic pipeline for RL environment, primarily for agentic MCP tool use scenario. Curiously, they especially highlight the FunReason-MT pipeline by Ant Group, which contains a few interesting
5.3K
Collinear AI
@CollinearAI
Jun 1
It's important for the community to reflect on in what areas the open-source labs have closed the gap on frontier capabilities: (1) 1M context. DeepSeek V4 tech report has sufficiently shown how compression + sparse selection of keys/values in attention can enable 1M context at
MiniMax (official)
@MiniMax_AI
Jun 1
Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M -
4.2K
Collinear AI
@CollinearAI
May 31
As agentic RL becomes more important in the research community, the problem of token-vs-text mismatch is now actively studied. Some throwbacks to earlier efforts from our side & frontier labs: - Back in January, when building our on-policy distillation framework Spider, we
clem 🤗
@ClementDelangue
May 29
Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get
7.4K
Collinear AI
@CollinearAI
May 29
In light of Claude Code's Dynamic Workflow rollout, we choose to review some solid multi-agent research by frontier labs, as they are very informative to the agentic research community. - Anthropic's early "open source recipe" for Workflow, where they use multi agents to build a
8.8K