“…A schema of the complete system is shown in Figure 1 on page 1. In contrast to previous works like [6,17,19,24] that build from different image and textual model, we start from the hypothesis of having a common embedding of images and text, realized by CLIP. As shown in [22], similar concepts expressed in text and images tend to share similar features, or at least be "near" in the common space.…”