Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I need to carefully read the article, but sparse attention is an interesting technique that has been used previously (as in BigBird) but has often proved to perform (way) worse than full attention. The sliding component that performs full attention is indeed useful (much like the Blockwise Parallel Transformer), but the sparse patterns are elements that don't intuitively resonate with me.

The model might select random words in the context. There's definitely a case where this could be unfortunate if it ends up selecting irrelevant words.

The graph on the first page, in my opinion, seems like a needless flex



> The graph on the first page, in my opinion, seems like a needless flex

Indeed - they used half of the cover page of their paper to show a chart which illustrates... nothing...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: