“…Similar to our approach, various recent methods have achieved computation efficiency by skipping computation on a subset of input tokens. However, the selection mechanism can be very different, such as using pooling (Nawrot et al, 2022), token merging (Bolya et al, 2023), learned sigmoid gates (Bapna et al, 2020) and early exiting (Schuster et al, 2022). CODA introduces a differentiable router to enhance trainability and model performance, and tackles the problem of large model adaptation.…”