“…For instance, test inputs may not be sufficiently representative of real-world settings [53,72], and performance metrics may not align with users' preferences and perceptions of ideal model performance [47,53,66]. To address this gap, a growing body of work in HCI aims to design performance evaluations grounded in downstream deployment contexts and the needs and goals of downstream stakeholders (e.g., [18,57,80,81]). This typically involves exploring users' domain-specific information needs [19,46], directly working with downstream stakeholders to collaboratively design evaluation datasets and metrics [80], and designing tools that allow users to specify their own test datasets and performance metrics [18,27,28,55,81].…”