Cathy Jiao

pic.jpg

Email: cljiao@cs.cmu.edu

I am a PhD student at the Language Technologies Institute in the School of Computer Science at Carnegie Mellon University, advised by Chenyan Xiong. Recently, I spent a wonderful summer in NYC at Spotify Research, hosted by Paul Bennett.

My research focuses on data-centric AI: designing methods/frameworks to better understand, curate, and evaluate the data for large language models. A central thread of my work is data attribution – identifying how individual data points shape model outputs – which I explore through efficient approximations and practical applications such as dataset curation and data valuation/pricing. More broadly, I aim to develop frameworks that make data usage more transparent, reliable, and impactful for both research and deployment.

Previously, I finished my masters at CMU LTI where I was advised by Maxine Eskenazi and Aaron Steinfeld. Before grad school, I spent time in industry working on machine learning and deep learning applications for natural language processing. Prior to that, I graduated with distinction from the University of British Columbia with a B.S. in CS & Math.

In my spare time, I enjoy cooking and biking around Pittsburgh. If you work on similar topics or want to chat, feel free to reach out!

Selected Publications

(See Google Scholar for all)

*= equal contribution

  1. DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models
    Cathy Jiao*Yijun Pan*, Emily Xiao*, Daisy Sheng, Niket Jain, Hanzhang Zhao, Ishita Dasgupta, Jiaqi W. Ma,  and Chenyan Xiong
    NeurIPS 2025
  2. Fairshare Data Pricing for Large Language Models
    Luyang Zhang*,  Cathy Jiao*Beibei Li,  and Chenyan Xiong
    NeurIPS 2025
  3. On the Feasibility of In-Context Probing for Data Attribution
    Cathy JiaoGary Gao, Aditi Raghunathan,  and Chenyan Xiong
    NAACL 2025
  4. A Survey of Data Attribution: Methods, Applications, and Evaluation in the Era of Generative AI
    Junwei Deng, Yuzheng Hu, Pingbang Hu, Ting-Wei Li, Shixuan Liu, Jiachen T Wang, Dan Ley, Qirun Dai, Benhao Huang, Jin Huang,  and others
    2025
  5. Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation
    Jessica Huynh,  Cathy JiaoPrakhar Gupta, Shikib Mehri, Payal Bajaj, Vishrav Chaudhary,  and Maxine Eskenazi
    IWSDS 2023
  6. Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning
    Prakhar Gupta,  Cathy JiaoYi-Ting Yeh, Shikib Mehri, Maxine Eskenazi,  and Jeffrey P. Bigham
    EMNLP 2022