Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My guess is that tiny LSTM is less computationally expensive than a tiny transformer. Since paper proposes replacing every weight with this tiny shared net, you want to be as tiny & performant as possible.


Transformers are more complicated than RNNs and require more fine tuning. I’m guessing the RNNs were used to simplify the problem. I’m not even sure transformers would work here given we’re theyre dropping them into a process


If you want it tiny and performant and choose LSTM over transformers why not just skip LSTM and use GRUs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: