Jonas Geiping (@jonasgeiping) / X

Jonas Geiping

667 posts

Jonas Geiping

@jonasgeiping

Machine Learning Researcher in Tübingen at the ELLIS Institute & Max-Planck for Intelligent Systems // Working on Safety & Efficiency of modern ML

Tübingen AI Center, Germany

jonasgeiping.github.io

Joined September 2021

Pinned
Jonas Geiping
@jonasgeiping
May 13
We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn. This bottlenecks even very intelligent agents to a single
GIF
158K
Jonas Geiping
@jonasgeiping
Feb 10, 2025
Ok, so I can finally talk about this! We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale. The model has an internal latent space in which it can adaptively spend more compute to think longer. I think the tech report ...🐦‍⬛
370K
Jonas Geiping
@jonasgeiping
Feb 10, 2025
Replying to @jonasgeiping
What is it doing when it thinks longer? We find evidence for pretty advanced structures in latent space, such as the tendency to use orbitals (see picture) to compute arithmetic tasks and reasoning about sentence structure So, this model really is a 🔄 shape-rotator 🔄
25K
Jonas Geiping
@jonasgeiping
Feb 10, 2025
Replying to @jonasgeiping
You can find the model here: huggingface.co/tomg-group-umd… The code here: github.com/seal-rg/recurr… and the tech report here: arxiv.org/abs/2502.05171 All data is public, and intermediate checkpoints are available!
tomg-group-umd/huginn-0125 · Hugging Face
From huggingface.co
15K
Jonas Geiping
@jonasgeiping
Jun 30, 2025
(Structured) Model pruning is a nice tool when you really need to deploy a model that is a *bit* smaller, but don't want to deploy a bigger hammer like quantization. We recently published an improved *automated* model pruning method, surprisingly based on model merging:
36K
Jonas Geiping
@jonasgeiping
Feb 10, 2025
Replying to @jonasgeiping
has something for everyone, new model architecture, optimizer details, AMD training (we trained on 4096 AMD GPUs), our data pipeline, pretraining details, and lots of analysis! Here are a few of my highlights:
23K
Jonas Geiping
@jonasgeiping
Jan 6, 2023
Last week, @tomgoldsteincs and me finally put our paper on cramming BERT training into limited resources on arxiv: arxiv.org/abs/2212.14034. Here are some remaining thoughts from my side: 1/9
arxiv.org
Cramming: Training a Language Model on a Single GPU in One Day
Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers...
39K
Jonas Geiping
@jonasgeiping
Feb 10, 2025
Replying to @jonasgeiping
First, the model (with 3.5B params), even though trained semi-optimally, and for 800B tokens, is competive with 7B open-source models trained for 2-3T tokens (OLMo-v1) - but we can't beat the new OLMo data recipe (yet) This is pretty exciting, for our first large-scale run
23K
Jonas Geiping
@jonasgeiping
Oct 19, 2023
Happy to announce that I've joined the ELLIS Institute and the Max-Planck for Intelligent Systems in Tübingen as a group leader! I'm excited to take a deep dive into the safety, security and efficiency of machine learning in the next years, working with both institutes 🦭.
Intelligent Systems
@MPI_IS
Oct 19, 2023
The first Group Leaders join our ELLIS Institute #Tübingen gGmbH as Hector Endowed Fellows: @orvieto_antonio, @wielandbr, @CeleMenDu, @jonasgeiping They will conduct cutting-edge #research in close collaboration with us and the Tübingen #AI Center: bit.ly/3PYCCSF
50K
Jonas Geiping
@jonasgeiping
Feb 10, 2025
Replying to @jonasgeiping
What is pretty exciting is that simply by training with our arch and objective, a separation emerges from scale - the model's latents converge quicker for some tokens in a sentence than others, In this figure the model takes more time to think about the key parts of the text:
18K
Jonas Geiping
@jonasgeiping
Jul 19, 2024
Modern LLMs have large vocab sizes and long seq lengths which leads to an annoying peak in memory due to logit activations... .... so, I wasted some time last month writing a fused triton kernel to do nn.Linear + nn.CrossEntropyLoss without a memory peak ⬇️
23K
Jonas Geiping
@jonasgeiping
Jun 1, 2023
How can you watermark the output of a diffusion model? Ideally, with a method than can be easily incorporated into existing pipelines, is invisible, and is robust to image manipulations? We look at this question in ***Tree-Ring Watermarks: Fingerprints for Diffusion Images***
78K
Jonas Geiping
@jonasgeiping
Feb 10, 2025
Replying to @jonasgeiping
We had enough compute for only a single shot to train at scale (and that is the model we've published). On reasoning tasks like GSM8k, the model is pretty competitive, even compared to other pretrained open-source models, even though we have done no post/mid-training...
20K
Jonas Geiping
@jonasgeiping
Feb 28, 2025
Modern LLMs are strong coders, as measured, for example, by their codeforces rating. But, are they also as capable of finding subtle bugs? We look at debugging codeforce submissions, and find that finding errors (and falsifying) is still very challenging More details here:
Shiven Sinha
@shiven_sinha
Feb 28, 2025
AI can generate correct-seeming hypotheses (and papers!). Brandolini's law states BS is harder to refute than generate. Can LMs falsify incorrect solutions? o3-mini (high) scores just 9% on our new benchmark REFUTE. Verification is not necessarily easier than generation 🧵
18K