Recognizing Textual Entailment (RTE) is among the most fundamental tasks in natural language processing applications, such as question answering and machine translation. One of the main challenges in logic-based approaches to this task is the lack of background knowledge. This study proposes a logical inference system with phrasal knowledge by comparing their visual representations based on the intuition that visual representations enable people to judge entailment relations. First, we obtain candidate phrase pairs for phrasal knowledge from logical inference. Second, using a vision-and-language model, we acquire the visual representations of these phrases in the form of images or embedding vectors. Finally, we compare these obtained visual representations to determine whether to inject the knowledge corresponding to the candidate. In addition to simple similarity between phrases, we also consider asymmetric relations when comparing visual representations. Our logical inference system improved accuracy on the SICK dataset compared with a previous logical inference system, SPSA (Selector of Predicates with Shared Arguments). Moreover, our asymmetric evaluation functions using vision-and-language models are effective at capturing the entailment relations of word pairs in HyperLex.INDEX TERMS Natural language processing, recognizing textual entailment, vision and language.