Cathy Jiao

prof_pic.jpg
Email: cljiao@cs.cmu.edu

I am a PhD student at the Language Technologies Institute in the School of Computer Science at Carnegie Mellon University, advised by Chenyan Xiong. Recently, I spent a wonderful summer in NYC at Spotify Research, hosted by Paul Bennett.

My research focuses on data-centric AI. A central thread of my work is data attribution: quantifying how data influences model training in foundation models, which I explore through efficient approximations and practical applications such as dataset curation and data valuation/pricing. More broadly, I aim to develop frameworks that make data usage more transparent, reliable, and impactful for both research and deployment of foundation models.

Previously, I finished my MS at CMU LTI where I worked on dialgue systems, advised by Maxine Eskenazi and Aaron Steinfeld. Before grad school, I spent time in industry working on machine learning and deep learning applications for natural language processing. Prior to that, I graduated with distinction from the University of British Columbia with a B.S. in CS & Math.

In my spare time, I enjoy biking around Pittsburgh and cooking. If you work on similar topics or want to chat, feel free to reach out!

News

Nov 01, 2025 :round_pushpin: I will be attending NeurIPS in San Diego — looking forward presenting our latest works!
Sep 18, 2025 :tada: DATE-LM was accepted to NeurIPS 2025. We introduce a rigorous, applications-driven benchmark for large-scale evaluation of data attribution methods in LLMs. Check out our :page_facing_up: paper, :computer: code, and :trophy: leaderboard.
Sep 18, 2025 :tada: Our work on Fairshare Data Pricing was accepted to NeurIPS 2025, introducing a data-influence–based framework for fair pricing of LLM training datasets. Check out our :page_facing_up: paper.
Feb 01, 2025 :tada: Our work on ICP for Data Attribution was accepted to NAACL 2025. We showed that simple probing of LLMs may serve as a practical proxy for gradient-based data attribution, enabling efficient identification of influential training samples. Check out our :page_facing_up: paper and :computer: code.

Selected Publications

(See Google Scholar for all)

*= equal contribution

  1. DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models
    Cathy Jiao* , Yijun Pan* , Emily Xiao* , Daisy Sheng , Niket Jain , Hanzhang Zhao , Ishita Dasgupta , Jiaqi W. Ma , and Chenyan Xiong
    2025
  2. Fairshare Data Pricing via Data Valuation for Large Language Models
    Luyang Zhang* , Cathy Jiao* , Beibei Li , and Chenyan Xiong
    2025
  3. NAACL
    On the Feasibility of In-Context Probing for Data Attribution
    Cathy Jiao , Gary Gao , Aditi Raghunathan , and Chenyan Xiong
    2025
  4. IWSDS
    Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation
    Jessica Huynh , Cathy Jiao , Prakhar Gupta , Shikib Mehri , Payal Bajaj , Vishrav Chaudhary , and Maxine Eskenazi
    2023
  5. EMNLP
    Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning
    Prakhar Gupta , Cathy Jiao , Yi-Ting Yeh , Shikib Mehri , Maxine Eskenazi , and Jeffrey P. Bigham
    2022