“…This puts a major imperative on obtaining high-quality crowdsourced human judgments. Previous research which employs crowdsourced judgments has focused on metrics including ease of answering, information flow and coherence (Li et al, 2016;Dziri et al, 2018), naturalness (Asghar et al, 2018), interestingness (Asghar et al, 2017;Santhanam and Shaikh, 2019), fluency or readability (Zhang et al, 2018), engagement (Venkatesh et al, 2018). While experiment designs primarily use Likert scales, Belz and Kow (2010) argue that discrete scales, such as the Likert scales, can be unintuitive and certain individuals may avoid extreme values in their judgments.…”