Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)
#ai #attention #transformer #deeplearning
Transformers are famous for two things: Their superior performance and their insane requirements of compute and memory. This paper reformulates the attention mechanism in terms of kernel functions and obtains a linear formulation, which reduces these requirements. Surprisingly, this formulation also surfaces an interesting connection between autoregressive transformers and RNNs.
OUTLINE:
0:00 - Intro & Overview
1:35 - Softmax Attention & Transformers
8:40 - Quadra