Christina Baek (@_christinabaek) / X

Christina Baek

173 posts

Christina Baek

@_christinabaek

research @openai // previously phd @mldcmu

Joined June 2021

Pinned
Christina Baek
@_christinabaek
Mar 18
Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵
96K
Christina Baek
@_christinabaek
Jul 4, 2022
Estimating out-of-distribution (OOD) performance is hard because labeled data is expensive. Can we predict OOD performance w/ only _unlabeled data_? In our work (arxiv.org/pdf/2206.13089…), we show this can be done using models’ agreement. w/ @yidingjiang, Aditi R., @zicokolter
00:00
Christina Baek
@_christinabaek
Apr 16, 2025
Are current reasoning models optimal for test-time scaling? 🌠 No! Models make the same incorrect guess over and over again. We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math! 1/N
56K
Christina Baek
@_christinabaek
May 7, 2024
Did you know that the optimizer Sharpness Aware Minimization (SAM) is very robust to heavy label noise, with gains tens of percent above SGD? In our new work, we deep dive into how SAM achieves these gains. As it turns out, it’s not at all about sharpness at convergence!
37K
Christina Baek
@_christinabaek
Oct 25, 2024
Chatbots are often augmented w/ new facts by context from the user or retriever. Models must adapt instead of hallucinating outdated facts. In this work w/@goyalsachin007, @zicokolter, @AdtRaghunathan, we show that instruction tuning fails to reliably improve this behavior! [1/n]
00:00
16K
Christina Baek
@_christinabaek
Apr 24, 2025
When we train models to do QA, are we robustly improving context dependency? No! In our ICLR Oral (Fri 11 AM), we show that if the base model knows the facts already, it shortcuts and learns to ignore the context completely! Visit us to learn more about knowledge conflicts 😀
13K
Christina Baek
@_christinabaek
Nov 28, 2022
Come visit Hall J #1037 on Tue Nov 29th, at 4 pm CST to hear about Agreement-on-the-Line (arxiv.org/abs/2206.13089). Attending NeurIPS all week, please feel free to reach out to talk about the work, generalization, etc. 🙂
Christina Baek
@_christinabaek
Dec 11, 2024
My coauthors and I will be presenting the following works on Agreement-on-the-Line _this Wednesday evening 4:30_ at #NeurIPS! If you're interested in unlabeled OOD performance estimation and understanding how deep models encode distribution shifts, come visit us! 🧵
8.5K
Christina Baek
@_christinabaek
Apr 8, 2024
Check out this really cool work w/@yidingjiang ! We built a simple system (PCA + Clustering) for quantifying how "features" are distributed across models and data. Using this tool, we can mathematically understand the Generalization Disagreement Equality. 🤝
Yiding Jiang
@yidingjiang
Apr 8, 2024
Models with different randomness make different predictions at test time even if they are trained on the same data. In our latest ICLR paper (oral), we investigate how models learn different features, and the effect this has on agreement and (potentially) calibration. 1/
6K
Christina Baek
@_christinabaek
Apr 16, 2025
Replying to @_christinabaek
What’s WiSE-FT? Simply take the earliest checkpoint and a later model and interpolate their weights. This can get you best-of-both-world gains. ▶️ Previously: get higher out-of-distribution acc _and_ in-distribution acc ▶️ Our paper: achieve higher Pass@1 _and_ Pass@K. 5/N
1.5K
Christina Baek
@_christinabaek
Apr 16, 2025
Replying to @_christinabaek
This work was completed in collaboration w/ @xingyudang, @wen_kaiyue, @AdtRaghunathan, @zicokolter. Xingyu and Kaiyue are brilliant incoming researchers, I can’t believe they’re just starting grad school. It was amazing to work with them! See more: arxiv.org/abs/2504.10478 10/N
arxiv.org
Weight Ensembling Improves Reasoning in Language Models
We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the...
1.3K
Christina Baek
@_christinabaek
Apr 7, 2022
Maximal coding rate reduction (MCR2) objective has many nice properties, but it's too expensive to feasibly train on large datasets: # of log det terms scales with # of classes. Our reformulation is not only scalable, but also seems to better control for noise.
Yi Ma
@YiMaTweets
Apr 4, 2022
A paper from recently graduated students: arxiv.org/abs/2204.00077 Two key messages: 1. the rate reduction objective can be computed more efficiently via its Variational Form; 2. the variation form can be naturally interpreted as Dictionary Learning. All things seem connected...
Christina Baek
@_christinabaek
Dec 9, 2024
This is a great work by @EungyeupK. It turns out that applying test-time adaptation improves Accuracy+Agreement-on-the-Line trends even in datasets like Camelyon17! A simple explanation: TTA collapses distribution shifts into “scale shifts” in the penultimate representation space
Eungyeup Kim
@EungyeupKim
Dec 9, 2024
Under distribution shifts, how can we evaluate model performance in OOD without labels? In our #NeurIPS2024 paper, we show how test-time adaptation (TTA) strengthens the fascinating "accuracy/agreement-on-the-line" trend—improving OOD predictability without labels! 🧵👇
1.2K
Christina Baek
@_christinabaek
Apr 16, 2025
Replying to @_christinabaek
Finally, diverse decoding strategies like min-p + high temperature can't match WiSE-FT. Why? We show a bias-variance tradeoff of test Pass@K, it 🔼 w/ test Pass@1 🔽 w/ diversity collapse Unlike WiSE-FT, decoding methods increase diversity at the cost of accuracy. 9/N
1.5K