“…In recent years, many efforts ( Zhu et al, 2017 ), e.g., developing novel network architectures ( Murray et al, 2019 , Cheng et al, 2020 , Bi et al, 2020 , Niazmardi et al, 2017 , Lin et al, 2020 , Zhu et al, 2018 ) and pipelines ( Byju et al, 2000 , Xu et al, 2020 , Wang et al, 2019 , Zhu et al, 2019 ), publishing large-scale datasets ( Xia et al, 2017 , Jin et al, 2018 ), introducing multi-modal and multi-temporal data ( Hu et al, 2020 , Tuia et al, 2016 , Ru et al, 2020 , Li et al, 2020a ), have been deployed to address this task, and most of them treat it as a single-label classification problem. A common assumption shared by these researches is that an aerial image belongs to only one scene category, while in real-world scenarios, it is more often that there exist various scenes in a single image (cf.…”