RATE: A Reliability-Aware Tester-Based Evaluation Framework of User Simulators

Labhishetty, Sahiti; Zhai, ChengXiang

doi:10.1007/978-3-030-99736-6_23

Cited by 5 publications

(3 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our evaluation methodology aims at letting the click model decide about the relative system performance that is known with high confidence or based on some reasonable heuristics. In the literature, this approach was recently introduced as the Tester-based approach [51,52]. The click model's system ranking is compared to the reference system ranking, and the rank correlation, determined by Kendall's , is an indicator of the simulation quality.…”

Section: Discussionmentioning

confidence: 99%

“…In order to compare the fidelity of user simulation, Labhishetty and Zhai [51,52] introduced the Tester-based approach. The key idea is based on the definition of Testers that are composed of single retrieval systems for which the relative retrieval effectiveness is known.…”

Section: User Simulationsmentioning

confidence: 99%

“…In our experiments, we include two types of system rankings, and selecting them is motivated by the Tester-based approach by Labhishetty and Zhai [51,52]. According to them, a user simulator (in this study, it is the click model) can be validated by its ability to distinguish the retrieval performance of methods for which we know the relative system effectiveness with high confidence or based on heuristics.…”

Section: Experimental Systemsmentioning

confidence: 99%

See 2 more Smart Citations

Validating Synthetic Usage Data in Living Lab Environments

Breuer,

Fuhr,

Schaer

2024

J. Data and Information Quality

View full text Add to dashboard Cite

Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data is available, click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data is sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data is available in moderate amounts. This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model’s estimates about a system ranking compared to a reference ranking for which the relative performance is known. Our experiments compare different click models and their reliability and robustness as more session log data becomes available. In our setup, simple click models can reliably determine the relative system performance with already 20 logged sessions for 50 queries. In contrast, more complex click models require more session data for reliable estimates, but they are a better choice in simulated interleaving experiments when enough session data is available. While it is easier for click models to distinguish between more diverse systems, it is harder to reproduce the system ranking based on the same retrieval algorithm with different interpolation weights. Our setup is entirely open, and we share the code to reproduce the experiments.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: User Simulationsmentioning

confidence: 99%

Section: Experimental Systemsmentioning

confidence: 99%

See 1 more Smart Citation