Nomenclature
ΣCovariance matrix G Gram/kernel matrix k(•)Kernel function
P(•) Probability density P(•)Token mixing process Re(•) Function that extracts the real component of a complex numberElement at ith position of column vector a A * :jColumn vector in jth row of A A i,jElement in ith row jth column ofmatrix of the embedding dimension F s L×L Vandermonde matrix of the sequence dimension W Weight matix learned with element-wise non-linearity (e.g., ReLU, GELU) W C L×L Weight matix of a single convolution kernel W K D×N Weight matix of attention key (for self-attention, N = M ) W Q D×M Weight matix of attention query W V D×M Weight matix of attention value X Resulting tokens with inductive bias introduced into X X L×D Input sequence of length L and embedding dimension D, where L D * Correspondence to