“…Nevertheless, pure text embeddings perform consistently best for training classes [30,31,32,33,35,32] in object detection. The projection from visual to semantic space is done by a linear layer [30,37,35], a single [31,34,33] or two-layer MLP [32], and learned with a max-margin losses [30,31,38,37], softplus-margin focal loss [35], or crossentropy loss [33,32]. Zhang et.…”