“…Despite this success, most previous systems are developed with, and evaluated on, datasets that contain exclusively single-hop questions (ones that require a single document or paragraph to answer) or two-hop ones. As a result, their design is often tailored exclusively to single-hop (e.g., Chen et al, 2017;Wang et al, 2018b) or multi-hop questions (e.g., Nie et al, 2019;Min et al, 2019;Feldman and El-Yaniv, 2019;Zhao et al, 2020a;Xiong et al, 2021). Even when the model is designed to work with both, it is often trained and evaluated on exclusively single-hop or multi-hop settings (e.g., Asai et al, 2020).…”