Log inSign up
Collinear AI
125 posts
Image
user avatar
Collinear AI
@CollinearAI
The AI Simulation Lab
collinear.ai
Joined October 2023
45
Following
531
Followers
  • user avatar
    Collinear AI
    @CollinearAI
    22h
    What does it take to identify tasks in the narrow gap between "trivial" and "impossible" for the world's strongest models? We're discussing this and more with some of the field's leading researchers at our office in Sunnyvale next week, before we go off to ACL and ICML in July.
    user avatar
    Collinear AI
    @CollinearAI
    Jun 23
    Most “hard” problems are useless for training a model. The useful ones sit in a narrow learnable region where the model fails sometimes and succeeds sometimes. Here are some early results where we took 5 trivial and 5 impossible cybersecurity environments and had a model
    Image
    Image
    Samosas and the Race to AGI · Luma
    From luma.com
    246
  • user avatar
    Collinear AI
    @CollinearAI
    Jun 23
    Most “hard” problems are useless for training a model. The useful ones sit in a narrow learnable region where the model fails sometimes and succeeds sometimes. Here are some early results where we took 5 trivial and 5 impossible cybersecurity environments and had a model
    Image
    1.7K
    user avatar
    Collinear AI
    @CollinearAI
    Jun 23
    Replying to @CollinearAI
    6/7 For an initially hard example, one bug was buried behind confusing arithmetic and misleading “legacy” comments, so the solver never found it. The creator made it legible, a plain value instead of an obfuscated one, plus a hint at the right file. Now the solver sometimes
    81
    user avatar
    Collinear AI
    @CollinearAI
    Jun 23
    7/7 In summary, with a few rounds of feedback, Opus 4.8 can reshape a task to land at the edge of GPT-5.5’s ability. Though an open question remains; is this due to the various iterations of feedback or because the creator has capabilities that the solver does not have?
    72
  • user avatar
    Collinear AI
    @CollinearAI
    Jun 21
    What are the real bottlenecks to AGI? Let's debate! Collinear HQ, Sunnyvale, June 29th Join researchers from xAI, Sierra and Amazon AGI for some hot takes, moderated by our own @nazneenrajani. Invite-only for ACL/ICML authors.
    Image
    380
    user avatar
    Collinear AI
    @CollinearAI
    Jun 22
    Request invite here -->
    Image
    Samosas and the Race to AGI · Luma
    From luma.com
    71
  • Collinear AI reposted
    user avatar
    Sachin
    Collinear AI
    @sachpatro97
    Jun 21
    There’s been a lot of talk about the new models getting scary good. Mythos on cybersecurity. GPT-5.5 Codex on coding. GLM 5.1 as the all-around daily driver. But the biggest AI research question we have is much simpler: Can it do it on a rainy night in Stoke? Little sneak peek
    worldcupbench.com
    WorldCupBench — Coming soon
    An agentic LLM benchmark where models manage national teams through a simulated FIFA World Cup 2026. From Collinear AI.
    367
  • user avatar
    Collinear AI
    @CollinearAI
    Jun 17
    1/n We benchmarked the top-4 models for solving real-life cybersecurity vulnerabilities on a collection of 221 tasks which covering more than 150+ CWEs. And found some interesting patterns..
    Image
    2.4K
    user avatar
    Collinear AI
    @CollinearAI
    Jun 17
    Replying to @CollinearAI
    3/n On the same set of 24 tasks Opus and Fable have the same pass@1 but Fable has a higher pass@4 showing that Fable has more variance in it rollouts - potentially making it interesting to scale parallel test time compute with!
    114
    user avatar
    Collinear AI
    @CollinearAI
    Jun 17
    4/n On one of the tasks involving a Java 100K+ codebase with 4 vulnerabilities (CWE94, CWE503, CWE862 and CWE863) both the models fixed the three obvious bug but the fourth one (CWE863) which is more subtle was only solved by Fable.
    98
  • user avatar
    Collinear AI
    @CollinearAI
    Jun 9
    We are hiring MTS with backgrounds in research, ML, engineering, product, and customer facing deployments. Come get your seat on the rocket ship 🚀
    user avatar
    Nazneen Rajani
    Collinear AI
    @nazneenrajani
    Jun 9
    In January of this year, the number of MTS with PhDs @CollinearAI was 2. Today it is 8 and 3 more joining next month. We believe we are building something special that lies on the critical path to AGI. We are not like the other RL env companies. Not every hill is worth
    Image
    281
  • user avatar
    Collinear AI
    @CollinearAI
    Jun 4
    We are pleased to see that the latest MAI-Thinking-1 model is strongly sustained by a synthetic pipeline for RL environment, primarily for agentic MCP tool use scenario. Curiously, they especially highlight the FunReason-MT pipeline by Ant Group, which contains a few interesting
    Image
    Image
    Image
    5.3K
  • user avatar
    Collinear AI
    @CollinearAI
    Jun 1
    It's important for the community to reflect on in what areas the open-source labs have closed the gap on frontier capabilities: (1) 1M context. DeepSeek V4 tech report has sufficiently shown how compression + sparse selection of keys/values in attention can enable 1M context at
    user avatar
    MiniMax (official)
    @MiniMax_AI
    Jun 1
    Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M -
    Image
    4.2K
  • user avatar
    Collinear AI
    @CollinearAI
    May 31
    As agentic RL becomes more important in the research community, the problem of token-vs-text mismatch is now actively studied. Some throwbacks to earlier efforts from our side & frontier labs: - Back in January, when building our on-policy distillation framework Spider, we
    Image
    Image
    Image
    user avatar
    clem 🤗
    @ClementDelangue
    May 29
    Most people training agentic LLMs with RL right now have a silently broken training loop and have no idea. Here's the trap: single-turn RL works beautifully. Clean curves, sane rewards, everything converges. Then you add tools so the model can act mid-rollout, and things get
    7.4K
  • user avatar
    Collinear AI
    @CollinearAI
    May 29
    In light of Claude Code's Dynamic Workflow rollout, we choose to review some solid multi-agent research by frontier labs, as they are very informative to the agentic research community. - Anthropic's early "open source recipe" for Workflow, where they use multi agents to build a
    Image
    Image
    8.8K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up