Cathy Jiao

pic.jpg

Email: cljiao@cs.cmu.edu

Hello! I am a PhD student at the Language Technologies Institute in the School of Computer Science at Carnegie Mellon University, advised by Chenyan Xiong.

I work broadly on data attribution for large language models (i.e., determining the contribution of training data samples towards model outputs). My current work investigates approximations for data attribution methods, with applications for dataset curation, and data valuation/pricing. Recently, I’m also interested in methodologies for evaluating data attribution methods.

Previously, I finished my masters at CMU LTI where I was advised by Maxine Eskenazi and Aaron Steinfeld. Before grad school, I spent time in industry working on machine learning and deep learning applications for natural language processing. Prior to that, I graduated with distinction from the University of British Columbia with a B.S. in Computer Science and Mathematics.

In my spare time, I enjoy cooking and biking around Pittsburgh. If you work on similar topics or want to chat, feel free to reach out!

[Google Scholar] [Semantic Scholar] [CV]

Publications

*= equal contribution

  1. Fairshare Data Pricing for Large Language Models
    Luyang Zhang*,  Cathy Jiao*Beibei Li,  and Chenyan Xiong
    Preprint. 2025
  2. On the Feasibility of In-Context Probing for Data Attribution
    Cathy JiaoGary Gao, Aditi Raghunathan,  and Chenyan Xiong
    NAACL 2025
  3. Examining Prosody in Spoken Navigation Instructions for People with Disabilities
    Cathy JiaoAaron Steinfeld,  and Maxine Eskenazi
    NAACL 2024, Workshop on Bridging HCI and NLP
  4. Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation
    Jessica Huynh,  Cathy JiaoPrakhar Gupta, Shikib Mehri, Payal Bajaj, Vishrav Chaudhary,  and Maxine Eskenazi
    IWSDS 2023
  5. Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning
    Prakhar Gupta,  Cathy JiaoYi-Ting Yeh, Shikib Mehri, Maxine Eskenazi,  and Jeffrey P. Bigham
    EMNLP 2022
  6. Improving compositional generalization for multi-step quantitative reasoning in question answering
    Armineh Nourbakhsh,  Cathy JiaoSameena Shah,  and Carolyn Rosé
    EMNLP 2022
  7. The DialPort tools
    Jessica Huynh*, Shikib Mehri*,  Cathy Jiao* and Maxine Eskenazi
    SIGDIAL 2022
  8. ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
    Ye Won Byun*,  Cathy Jiao*Shahriar Noroozizadeh*, Jimin Sun*,  and Rosa Vitiello*
    CVPR 2022, Embodied AI Workshop