“…In the literature, numerous work has shown that understanding the objects' attributes can greatly facilitate object recognition and detection, even with few or no examples of visual objects [6,18,25,43,53], for example, Farhadi et al proposed to shift the goal of object recognition from 'naming' to 'description', which allows naming familiar objects with attributes, but also to say something about unfamiliar objects ("hairy and four-legged", not just "unknown") [6]; Lampert et al considered the open-set object recognition, that aims to recognise objects by humanspecified high-level description, e.g., arbitrary semantic attributes, like shape, color, or even geographic information, instead of training images [18]. However, the problem considered in these seminal work tends to be a simplification from today's standard, for example, attribute classification are often trained and evaluated on object-centric images under the close-set scenario, i.e., assuming the bounding boxes/segmentation masks are given [13,29,38], or sometimes even the object category are known as a prior [26,29].…”