“…Previous methods include affinity-based methods (Aksoy et al, 2018;Levin et al, 2007), sampling-based methods (He et al, 2011;Shahrian et al, 2013), and deep learning based methods (Hou & Liu, 2019;Liu et al, 2021a;Lu et al, 2019;Sun et al, 2021). Besides, there are other methods using different auxiliary inputs, e.g., a background image (Lin et al, 2020;Sengupta et al, 2020), a coarse map (Dai et al, 2022;Yu et al, 2021) or even language descriptions (Li et al, 2023).…”