Log inSign up
Christina Baek
173 posts
Image
user avatar
Christina Baek
@_christinabaek
research @openai // previously phd @mldcmu
kebaek.github.io
Joined June 2021
677
Following
2,099
Followers
  • Pinned
    user avatar
    Christina Baek
    @_christinabaek
    Mar 18
    Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵
    Image
    96K
  • user avatar
    Christina Baek
    @_christinabaek
    Jul 4, 2022
    Estimating out-of-distribution (OOD) performance is hard because labeled data is expensive. Can we predict OOD performance w/ only _unlabeled data_? In our work (arxiv.org/pdf/2206.13089…), we show this can be done using models’ agreement. w/ @yidingjiang, Aditi R., @zicokolter
    Image
    00:00
  • user avatar
    Christina Baek
    @_christinabaek
    Apr 16, 2025
    Are current reasoning models optimal for test-time scaling? 🌠 No! Models make the same incorrect guess over and over again. We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math! 1/N
    Image
    56K
  • user avatar
    Christina Baek
    @_christinabaek
    May 7, 2024
    Did you know that the optimizer Sharpness Aware Minimization (SAM) is very robust to heavy label noise, with gains tens of percent above SGD? In our new work, we deep dive into how SAM achieves these gains. As it turns out, it’s not at all about sharpness at convergence!
    Image
    37K
  • user avatar
    Christina Baek
    @_christinabaek
    Oct 25, 2024
    Chatbots are often augmented w/ new facts by context from the user or retriever. Models must adapt instead of hallucinating outdated facts. In this work w/@goyalsachin007, @zicokolter, @AdtRaghunathan, we show that instruction tuning fails to reliably improve this behavior! [1/n]
    Image
    00:00
    16K
  • user avatar
    Christina Baek
    @_christinabaek
    Apr 24, 2025
    When we train models to do QA, are we robustly improving context dependency? No! In our ICLR Oral (Fri 11 AM), we show that if the base model knows the facts already, it shortcuts and learns to ignore the context completely! Visit us to learn more about knowledge conflicts 😀
    Image
    13K
  • user avatar
    Christina Baek
    @_christinabaek
    Nov 28, 2022
    Come visit Hall J #1037 on Tue Nov 29th, at 4 pm CST to hear about Agreement-on-the-Line (arxiv.org/abs/2206.13089). Attending NeurIPS all week, please feel free to reach out to talk about the work, generalization, etc. 🙂
  • user avatar
    Christina Baek
    @_christinabaek
    Dec 11, 2024
    My coauthors and I will be presenting the following works on Agreement-on-the-Line _this Wednesday evening 4:30_ at #NeurIPS! If you're interested in unlabeled OOD performance estimation and understanding how deep models encode distribution shifts, come visit us! 🧵
    8.5K
  • user avatar
    Christina Baek
    @_christinabaek
    Apr 8, 2024
    Check out this really cool work w/@yidingjiang ! We built a simple system (PCA + Clustering) for quantifying how "features" are distributed across models and data. Using this tool, we can mathematically understand the Generalization Disagreement Equality. 🤝
    user avatar
    Yiding Jiang
    @yidingjiang
    Apr 8, 2024
    Models with different randomness make different predictions at test time even if they are trained on the same data. In our latest ICLR paper (oral), we investigate how models learn different features, and the effect this has on agreement and (potentially) calibration. 1/
    Image
    6K
  • user avatar
    Christina Baek
    @_christinabaek
    Apr 16, 2025
    Replying to @_christinabaek
    What’s WiSE-FT? Simply take the earliest checkpoint and a later model and interpolate their weights. This can get you best-of-both-world gains. ▶️ Previously: get higher out-of-distribution acc _and_ in-distribution acc ▶️ Our paper: achieve higher Pass@1 _and_ Pass@K. 5/N
    Image
    1.5K
  • user avatar
    Christina Baek
    @_christinabaek
    Apr 16, 2025
    Replying to @_christinabaek
    This work was completed in collaboration w/ @xingyudang, @wen_kaiyue, @AdtRaghunathan, @zicokolter. Xingyu and Kaiyue are brilliant incoming researchers, I can’t believe they’re just starting grad school. It was amazing to work with them! See more: arxiv.org/abs/2504.10478 10/N
    arXiv logo
    arxiv.org
    Weight Ensembling Improves Reasoning in Language Models
    We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the...
    1.3K
  • user avatar
    Christina Baek
    @_christinabaek
    Apr 7, 2022
    Maximal coding rate reduction (MCR2) objective has many nice properties, but it's too expensive to feasibly train on large datasets: # of log det terms scales with # of classes. Our reformulation is not only scalable, but also seems to better control for noise.
    user avatar
    Yi Ma
    @YiMaTweets
    Apr 4, 2022
    A paper from recently graduated students: arxiv.org/abs/2204.00077 Two key messages: 1. the rate reduction objective can be computed more efficiently via its Variational Form; 2. the variation form can be naturally interpreted as Dictionary Learning. All things seem connected...
  • user avatar
    Christina Baek
    @_christinabaek
    Dec 9, 2024
    This is a great work by @EungyeupK. It turns out that applying test-time adaptation improves Accuracy+Agreement-on-the-Line trends even in datasets like Camelyon17! A simple explanation: TTA collapses distribution shifts into “scale shifts” in the penultimate representation space
    user avatar
    Eungyeup Kim
    @EungyeupKim
    Dec 9, 2024
    Under distribution shifts, how can we evaluate model performance in OOD without labels? In our #NeurIPS2024 paper, we show how test-time adaptation (TTA) strengthens the fascinating "accuracy/agreement-on-the-line" trend—improving OOD predictability without labels! 🧵👇
    Image
    1.2K
  • user avatar
    Christina Baek
    @_christinabaek
    Apr 16, 2025
    Replying to @_christinabaek
    Finally, diverse decoding strategies like min-p + high temperature can't match WiSE-FT. Why? We show a bias-variance tradeoff of test Pass@K, it 🔼 w/ test Pass@1 🔽 w/ diversity collapse Unlike WiSE-FT, decoding methods increase diversity at the cost of accuracy. 9/N
    Image
    1.5K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up