“…β’ Fusion: returns a linear aggregation of the intermediate representations of a frozen model for each token and task, i.e., A (π΅) = π π=1 (π΄ β π΅) π,:,: where π΄ is an π Γ π attention matrix and π΄ β π΅ is the Hadamard product between π΄ and every feature view of π΅. The attention matrix π΄ is based on a π-dimensional taskspecific attention vector π learned during training, i.e., π΄ π,π = exp(πΓπ΅ π,π,: ) π π=1 exp(πΓπ΅ π,π,: ) (Cao et al, 2022).…”