Cathy Jiao

Email: cljiao@cs.cmu.edu

Hello! I am a PhD student at the Language Technologies Institute in the School of Computer Science at Carnegie Mellon University, advised by Chenyan Xiong.

I work broadly on data attribution for large language models (i.e., determining the contribution of training data samples towards model outputs). My current work investigates approximations for data attribution methods, with applications for dataset curation, and data valuation/pricing. Recently, I’m also interested in methodologies for evaluating data attribution methods.

Previously, I finished my masters at CMU LTI where I was advised by Maxine Eskenazi and Aaron Steinfeld. Before grad school, I spent time in industry working on machine learning and deep learning applications for natural language processing. Prior to that, I graduated with distinction from the University of British Columbia with a B.S. in Computer Science and Mathematics.

In my spare time, I enjoy cooking and biking around Pittsburgh. If you work on similar topics or want to chat, feel free to reach out!

[Google Scholar] [Semantic Scholar] [CV]

Publications

*= equal contribution

Fairshare Data Pricing for Large Language Models

Luyang Zhang*, Cathy Jiao*, Beibei Li, and Chenyan Xiong

Preprint. 2025

Abs PDF

Training data is a pivotal resource for building large language models (LLMs), but unfair pricing in data markets poses a serious challenge for both data buyers (e.g., LLM builders) and sellers (e.g., human annotators), which discourages market participation, reducing data quantity and quality. In this paper, we propose a fairshare pricing framework that sets training data prices using data valuation methods to quantify their contribution to LLMs. In our framework, buyers make purchasing decisions using data valuation and sellers set prices to maximize their profits based on the anticipated buyer purchases. We theoretically show that pricing derived from our framework is tightly linked to data valuation and buyers’ budget, optimal for both buyers and sellers. Through market simulations using current LLMs and datasets (math problems, medical diagnosis, and physical reasoning), we show that our framework is fairshare for buyers by ensuring their purchased data is reflective of model training value, leading to higher LLM task performances per-dollar spent on data, and fairshare for sellers by ensuring they sell their data at optimal prices. Our framework lays the foundation for future research on equitable and sustainable data markets for large-scale AI.
On the Feasibility of In-Context Probing for Data Attribution

Cathy Jiao, Gary Gao, Aditi Raghunathan, and Chenyan Xiong

NAACL 2025

Abs PDF Code

Data attribution methods are used to measure the contribution of training data towards model outputs, and have several important applications in areas such as dataset curation and model interpretability. However, many standard data attribution methods, such as influence functions, utilize model gradients and are computationally expensive. In our paper, we show in-context probing (ICP) – prompting a LLM – can serve as a fast proxy for gradient-based data attribution for data selection under conditions contingent on data similarity. We study this connection empirically on standard NLP tasks, and show that ICP and gradient-based data attribution are well-correlated in identifying influential training data for tasks that share similar task type and content as the training data. Additionally, fine-tuning models on influential data selected by both methods achieves comparable downstream performance, further emphasizing their similarities. We also examine the connection between ICP and gradient-based data attribution using synthetic data on linear regression tasks. Our synthetic data experiments show similar results with those from NLP tasks, suggesting that this connection can be isolated in simpler settings, which offers a pathway to bridging their differences.
Examining Prosody in Spoken Navigation Instructions for People with Disabilities

Cathy Jiao, Aaron Steinfeld, and Maxine Eskenazi

NAACL 2024, Workshop on Bridging HCI and NLP

Abs PDF

The introduction of conversational systems have made synthesized speech technologies common tools for daily activities. However, not all synthetic speech systems are designed with the needs of people with disabilities in mind. This paper describes a study in which 198 people – 80 participants with self-reported disabilities and 118 participants without – were recruited to listen to navigation instructions from a spoken dialogue system with different prosodic features. Results showed that slowing down speech rate aids in participants’ number recall, but not in noun recall. From our results, we provide suggestions for developers for building accessible synthetic speech systems.
Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

Jessica Huynh, Cathy Jiao, Prakhar Gupta, Shikib Mehri, Payal Bajaj, Vishrav Chaudhary, and Maxine Eskenazi

IWSDS 2023

Abs PDF

Language models have steadily increased in size over the past few years. They achieve a high level of performance on various natural language processing (NLP) tasks such as question answering and summarization. Large language models (LLMs) have been used for generation and can now output human-like text. Due to this, there are other downstream tasks in the realm of dialog that can now harness the LLMs’ language understanding capabilities. Dialog evaluation is one task that this paper will explore. It concentrates on prompting with LLMs: BLOOM, OPT, GPT-3, Flan-T5, InstructDial and TNLGv2. The paper shows that the choice of datasets used for training a model contributes to how well it performs on a task as well as on how the prompt should be structured. Specifically, the more diverse and relevant the group of datasets that a model is trained on, the better dialog evaluation performs. This paper also investigates how the number of examples in the prompt and the type of example selection used affect the model’s performance.
Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning

Prakhar Gupta, Cathy Jiao, Yi-Ting Yeh, Shikib Mehri, Maxine Eskenazi, and Jeffrey P. Bigham

EMNLP 2022

Abs PDF Code

Instruction tuning is an emergent paradigm in NLP wherein natural language instructions are leveraged with language models to induce zero-shot performance on unseen tasks. Instructions have been shown to enable good performance on unseen tasks and datasets in both large and small language models. Dialogue is an especially interesting area to explore instruction tuning because dialogue systems perform multiple kinds of tasks related to language (e.g., natural language understanding and generation, domain-specific interaction), yet instruction tuning has not been systematically explored for dialogue-related tasks. We introduce InstructDial, an instruction tuning framework for dialogue, which consists of a repository of 48 diverse dialogue tasks in a unified text-to-text format created from 59 openly available dialogue datasets. Next, we explore cross-task generalization ability on models tuned on InstructDial across diverse dialogue tasks. Our analysis reveals that InstructDial enables good zero-shot performance on unseen datasets and tasks such as dialogue evaluation and intent detection, and even better performance in a few-shot setting. To ensure that models adhere to instructions, we introduce novel meta-tasks. We establish benchmark zero-shot and few-shot performance of models trained using the proposed framework on multiple dialogue tasks.
Improving compositional generalization for multi-step quantitative reasoning in question answering

Armineh Nourbakhsh, Cathy Jiao, Sameena Shah, and Carolyn Rosé

EMNLP 2022

Abs PDF

Quantitative reasoning is an important aspect of question answering, especially when numeric and verbal cues interact to indicate sophisticated, multi-step programs. In this paper, we demonstrate how modeling the compositional nature of quantitative text can enhance the performance and robustness of QA models, allowing them to capture arithmetic logic that is expressed verbally. Borrowing from the literature on semantic parsing, we propose a method that encourages the QA models to adjust their attention patterns and capture input/output alignments that are meaningful to the reasoning task. We show how this strategy improves program accuracy and renders the models more robust against overfitting as the number of reasoning steps grows. Our approach is designed as a standalone module which can be prepended to many existing models and trained in an end-to-end fashion without the need for additional supervisory signal. As part of this exercise, we also create a unified dataset building on four previously released numerical QA datasets over tabular data.
The DialPort tools

Jessica Huynh*, Shikib Mehri*, Cathy Jiao*, and Maxine Eskenazi

SIGDIAL 2022

Abs PDF

The DialPort project (http://dialport.org/), funded by the National Science Foundation (NSF), covers a group of tools and services that aim at fulfilling the needs of the dialog research community. Over the course of six years, several offerings have been created, including the DialPort Portal and DialCrowd. This paper describes these contributions, which will be demoed at SIGDIAL, including implementation, prior studies, corresponding discoveries, and the locations at which the tools will remain freely available to the community going forward.
ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

Ye Won Byun*, Cathy Jiao*, Shahriar Noroozizadeh*, Jimin Sun*, and Rosa Vitiello*

CVPR 2022, Embodied AI Workshop

Abs PDF

We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.