I need to carefully read the article, but sparse attention is an interesting technique that has been used previously (as in BigBird) but has often proved to perform (way) worse than full attention.
The sliding component that performs full attention is indeed useful (much like the Blockwise Parallel Transformer), but the sparse patterns are elements that don't intuitively resonate with me.
The model might select random words in the context. There's definitely a case where this could be unfortunate if it ends up selecting irrelevant words.
The graph on the first page, in my opinion, seems like a needless flex
The model might select random words in the context. There's definitely a case where this could be unfortunate if it ends up selecting irrelevant words.
The graph on the first page, in my opinion, seems like a needless flex