2017 IEEE Conference on Visual Analytics Science and Technology (VAST) 2017
DOI: 10.1109/vast.2017.8585552
|View full text |Cite
|
Sign up to set email alerts
|

Visual Analysis for Wildlife Preserve based on Muti-systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 3 publications
0
2
0
Order By: Relevance
“…One line of research aims to augment a pre-trained language model with visual modality by learning a mapping from an external visual encoder to the frozen language model [2,20,26], demonstrating a zero-shot capability to perform multi-modal tasks leveraging knowledge stored in the language model. Recently, PaLI [6] fine-tuned on OK-VQA achieved SOTA performance and outperformed previous retrieval-based methods [11,27] by a substantial margin. However, our results on the INFOSEEK benchmark show that PaLI falls significantly behind pipeline systems with access to a knowledge base (KB), particularly on Time and Numerical questions.…”
Section: Related Workmentioning
confidence: 97%
See 1 more Smart Citation
“…One line of research aims to augment a pre-trained language model with visual modality by learning a mapping from an external visual encoder to the frozen language model [2,20,26], demonstrating a zero-shot capability to perform multi-modal tasks leveraging knowledge stored in the language model. Recently, PaLI [6] fine-tuned on OK-VQA achieved SOTA performance and outperformed previous retrieval-based methods [11,27] by a substantial margin. However, our results on the INFOSEEK benchmark show that PaLI falls significantly behind pipeline systems with access to a knowledge base (KB), particularly on Time and Numerical questions.…”
Section: Related Workmentioning
confidence: 97%
“…Various approaches have been proposed to address knowledge-based VQA tasks [33] by incorporating external knowledge into visionlanguage models. One approach is to retrieve information from an external knowledge base [16,32,53] and employ fusion-in-decoder (FID) [17] to perform language QA [11,27]. Another approach is to transform the image into a text caption and use an LLM [5,7] to answer questions [15,56].…”
Section: Related Workmentioning
confidence: 99%