Large vision-language models (LVLMs) have recently achieved rapid progress, exhibiting great perception and reasoning abilities concerning visual information. However, when faced with prompts in different sizes of solution spaces, LVLMs fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of the discriminative question type exhibits a strong positive correlation with its Consistency with the caption. ( 3) Compared to open-source models, closed-source models exhibit a pronounced bias advantage in terms of Consistency. Eventually, we ameliorate the consistency of LVLMs by trigger-based diagnostic refinement, indirectly improving the performance of their caption. We hope this paper will accelerate the research community in better evaluating their models and encourage future advancements in the consistency domain.
IntroductionRecently, benefiting from notable advancements in large language models (LLMs) [1; 25; 2], the realm of large vision-language models (LVLMs) has undergone a revolutionary transformation. These novel LVLMs [18;24;3;8;15;13] try to combine visual signals with textual semantics and spark cognitive brilliance across modalities. Although LVLMs can generate high-quality responses to task prompts, we discover that for correctly answered cases, simply modifying the prompt will result LVLMs in providing contradictory responses. In Figure 1 (a.2), LLaVA-7B [18] properly describes the picture as "It is a man wearing a dinosaur costume.", but when prompted "Is the dinosaur played by humans? Please answer yes or no.", it responds with "No, they are dinosaurs". The above phenomenon of Inconsistency is widely observed across mainstream LVLMs, and a preliminary study was conducted only on LLMs [14]. In practice, in contrast to the fixed patterns of questions, designed in existing multimodal benchmarks, the users tend to pose questions in arbitrary ways. Therefore, it is necessary to ensure the LVLMs in predicting correct and consistent answers, even when faced with various formats of queries.However, there are currently no benchmarks or research studies that specifically focus on evaluating the Consistency of LVLMs responses. These single-prompt type evaluation approaches [12; 10; 28; 21; 6] lead to a disconnect between benchmark accuracy and real-world user practical experience.Preprint. Under review.