I wasn't sure if we were going to share this, because knowing what doesn't work is often more valuable than seeing what worked.
That - and being nervous about sharing your failures.
Here's a technical retrospective on our 2025:
Kimi K2 is genuinely impressive. On the same tasks and the same agentic harness, one on one beats Grok 4.
Also does it without CoT or thinking tokens looks like.
Kimi is the real deal. Unless it's really Sonnet in a trench coat, this is the best agentic open-source model I've tested - BY A MILE.
Here's a slice* of a 4 HOUR run (~1 second per minute) with not much more than 'keep going' from me every 90 minutes or so.
The task involved
A week ago one of our customers handed us 1000 pages of this (10,000 more to come), and asked us for RAG solution.
We said yes - because we said yes before we saw the document. But we've solved it - and there's a chance it's a strong improvement on all RAG SoTA.
No more waiting - we finally we have a demo for multi-modal, 'walking' RAG!
Still blows my mind - this is an AI that's reading complex diagrams in a document like a human, 'looking' at pages, then 'walking' to more relevant pages until it's found an answer.
More details below
How many years until open source models take over? Is it never?
I've been manually testing 20-30 different models all claiming impressive scores on benchmarks against OpenAI and Anthropic. What I've found:
* Super tiny models are becoming insanely good
* Medium models are
How exactly do Language Models perceive time?
This is one of the best papers I've read this year (from Kai Nylund, @ssgrn, @nlpnoah), and here's what it suggests (IMO) 👇
This is genuinely blowing my mind - four years of everything we've done at Greywing, finished in 60 seconds
The rest is just me fooling around.
Before you ask it's not the Assistants API - that's why we have interactive charts, abort, <200ms latency.
I decompiled Claude Code from just the minified code. Took me 8-10 hours, multiple subagents, and every flagship model from every provider.
Holy shit there's a lot in there. Claude Code is NOT just Claude in a loop - there's so much to learn from.
This is scary - ETL pipelines and ORMs are likely going away - or at least I shouldn't be getting paid for doing them anymore.
This is AI generating thousands of lines of typespecs and DDLs (with no more context than the dataset), and somehow it's all 100% correct.
Rant?👇
Another way to make Claude Code a 10x engineer for a complex change:
1. Make a plan for the change (if you need it) with Gemini.
2. Open a new branch.
3. Ask Claude to implement the change and maintain a scratchpad.md that is an APPEND-ONLY log with gotchas, judgement
Mindblown - this is a 7b local model with 128K context combining Metamorphosis with The Last Question to write a new story using just 10 GB of RAM
Even two months ago this would be unfathomable.
Next is to try 20k tokens of SQL DDLs, for complex data
(Model below)
Turns out I was wrong. Gemini is 30x cheaper for transcription (same quality) if you prompt right and segment to stay under 128k.
So how good is it? It's crazy for clean audio (source+code in 🧵)
AssemblyAI: 92.06% ($0.21)
Flash-002: 92.68% ($0.00679) 🤯
Let me say more 👇