Enhancing Speech Recognition Decoding via Layer Aggregation

Wullach, Tomer; Chazan, Shlomo E.

doi:10.48550/arxiv.2203.11325

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multi-layer feature utilization has been demonstrated to be an effective method of making full use of the information contained in different layers of the model to improve the representation and generalization capabilities of computer vision [23,30,42,44,50,59,81,86,92], natural language processing [1,3,11,12,26,53,64,69,76,79] and multi-modal models [13,49].…”

Section: Multi-layer Feature Utilizationmentioning

confidence: 99%

BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning

Xu¹,

Wu²,

Rosenman³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision-Language (VL) models with the TWO-TOWER architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a cross-modal encoder, or feed the last-layer uni-modal features directly into the top cross-modal encoder, ignoring the semantic information at the different levels in the deep uni-modal encoders. Both approaches possibly restrict vision-language representation learning and limit model performance. In this paper, we introduce multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables comprehensive bottom-up interactions between visual and textual representations at different semantic levels, resulting in more effective cross-modal alignment and fusion. Our proposed BRIDGE-TOWER, pre-trained with only 4M images, achieves state-of-the-art performance on various downstream vision-language tasks. On the VQAv2 test-std set, BRIDGE-TOWER achieves an accuracy of 78.73%, outperforming the previous state-of-the-art METER model by 1.09% with the same pretraining data and almost no additional parameters and computational cost. Notably, when further scaling the model, BRIDGE-TOWER achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code is available at https://github.com/microsoft/BridgeTower. * Equal contribution. Contribution during Xiao's internship at Microsoft. † Contact Person Preprint. Under review.

show abstract