DatologyAI (@datologyai) / X

DatologyAI

210 posts

DatologyAI

@datologyai

DatologyAI builds tools to automatically select and optimize the best data on which to train AI models, leading to better, smaller models which train faster.

Redwood City, CA

datologyai.com

Joined September 2023

Following

3,049

Followers

DatologyAI
@datologyai
5h
The "you can only catch up by distilling from a frontier model" narrative is wrong. We curated the data for @Arceeai's Trinity Large entirely from public sources, zero closed-model APIs, and it's competitive with the open frontier. Better data does the work.
00:00
687
DatologyAI
@datologyai
5h
Full episode:
137
DatologyAI
@datologyai
Jun 19
Compute scarcity is about to force the reckoning the frontier labs have avoided: efficiency. You don't need trillion-parameter models for frontier-class capability. With better data, far smaller models match the best of a year or two ago, at a fraction of the cost to serve.
00:00
1.1K
DatologyAI
@datologyai
Jun 19
Full episode:
589
DatologyAI
@datologyai
Jun 18
Replying to @datologyai
5/ Alexander Gurung from the University of Edinburgh presented his work on learning to reason for long-form generation. What does a reward signal look like when the goal is a good story? 📺
358
DatologyAI
@datologyai
Jun 18
Replying to @datologyai
14/ Sukjun Hwang (@sukjun_hwang) from CMU presented his work on H-Nets: Dynamic chunking for end-to-end hierarchical sequence modeling 📄
arxiv.org
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures...
187
DatologyAI
@datologyai
Jun 18
15/ What a lineup, and that was only year one. Summer of Data is back for 2026 and we're just getting started. Keep an eye out for our lineup announcement and new talks every week. Want to present? DM us 👀 Stay data-obsessed 🤓
165
DatologyAI
@datologyai
Jun 18
Replying to @datologyai
8/ Xindi Wu (@cindy_x_wu) from Princeton presented her work on data efficiencies for multimodal ML (COMPACT). How do you teach a model to compose visual capabilities, atomic to complex? 📺
203
DatologyAI
@datologyai
Jun 18
Replying to @datologyai
7/ Suhas Kotha (@kothasuhas) from Stanford presented his work on why standard fine-tuning inefficiently uses rare data. How do you get a model to learn from the examples that matter most? 📺
193
DatologyAI
@datologyai
Jun 18
Replying to @datologyai
6/ Jacob Springer (@jacspringer) from CMU presented his work on echo embeddings and why overtrained language models are harder to fine-tune. Is more pretraining always a free lunch? 📺
773
DatologyAI
@datologyai
Jun 18
Replying to @datologyai
4/ Shizhe Diao (@shizhediao) from Thinking Machines presented his work on CLIMB, clustering-based iterative data selection for pretraining. Can a model find its own best data blend? 📺
283
DatologyAI
@datologyai
Jun 18
Replying to @datologyai
3/ Maximilian Böther (@MaxiBoether) from ETH Zurich presented his work on Mixtera, a data plane for foundation model training. How do you manage what your model eats at scale? He is now working @datologyai on cool dataloader improvements 📺
412
DatologyAI
@datologyai
Jun 18
Replying to @datologyai
2/ Charlie Snell (@sea_snell) from UC Berkeley presented his work on scaling test-time compute and predicting emergent capabilities by finetuning. When does it pay to let a model think longer? 📺
628
DatologyAI
@datologyai
Jun 18
1/ 🌞 Our Summer of Data Seminar brought together some of the sharpest minds in data curation last year. We are bringing it back in 2026! Let's recap the great talks from 2025!
4.2K
DatologyAI reposted
DatologyAI
@datologyai
Jun 15
A spicy take from @arimorcos on @jacobeffron's Unsupervised Learning: frontier APIs may not always be there. The teams that can build their own models won't be exposed when that happens.
00:00
1.3K