Visio-Linguistic Brain Encoding

Oota, Subba Reddy; Arora, Jashn; Rowtula, Vijay; Gupta, Manish; Bapi, Raju S.

doi:10.48550/arxiv.2204.08261

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further exploration shows that the V2 and V3 are better predicted by multimodal models, with higher layers in image transformers (i.e. unimodal) correlating with late visual areas and vice versa (Oota et al, 2022). It is also suggested that CLIP’s learning process may capture how abstract concepts penetrate early vision in a top-down manner.…”

Section: Introductionmentioning

confidence: 99%

Video-Language Models as Flexible Social and Physical Reasoners

Fleury

2024

Preprint

View full text Add to dashboard Cite

From an early age, humans are challenged with evaluating rich environments full of socially and physically grounded concepts. For example, we might be spectating a rapidly unfolding tennis match, anticipating ball trajectories based on players’ body cues and goals. In another scenario, we may engage with long storylines, juggling the mental states of characters with varying knowledge of an unfolding conflict. The complexity of this learning problem is notable as it can be multimodal, integrate information at varying timescales, and implicitly co-attend to social and physical scene properties for downstream reasoning. Large language-vision models like GPT4-V, LLaMA-3, which use vision-language embeddings, show skills in commonsense psychology and physics, though they only process single images. Models like CLIP and VisualBERT encode visual information in high-level cortical areas but do not inherently capture video-level representations. This paper introduces a novel video-language architecture that incorporates pooled video embeddings into LLMs by first extracting spatiotemporal embeddings and mapping them to the model’s decoder through a learnable linear layer. We enhance the model by training it with video-caption pairs from the ADEPT and AGENT datasets, aimed at quantifying “surprisal” in physical and psychological contexts. Finally, we design separate voxel wise encoding models for videos involving physics and psychology using the hidden states and logits from the LLMs last layer and pre-projected CLIP embeddings. We find that hidden state activations can remarkably explain high variance (R^2 up to ∼70%) across dorsal physics regions and highly distributed, ventral social vision areas. Notably, for models trained to only encode physically surprising stimuli, the hidden states andpre-projected CLIP embeddings explain nearly identical regions of variance across the inferior-parietal lobule. However, when the encoding model is trained to encode only socially surprising events, hidden states explain far more distributed ventral and dorsal activations over pre-projected CLIP embeddings.

show abstract

Section: Introductionmentioning

confidence: 99%

Video-Language Models as Flexible Social and Physical Reasoners

Fleury

2024

Preprint

View full text Add to dashboard Cite

show abstract

Visio-Linguistic Brain Encoding

Cited by 1 publication

References 0 publications

Video-Language Models as Flexible Social and Physical Reasoners

Video-Language Models as Flexible Social and Physical Reasoners

Contact Info

Product

Resources

About