While I agree with the beginning of your post, you lost me here:
> The bitter lesson [1] is going to eventually come for all of these. Eventually we'll figure out how to machine-learn the heuristic rather than hard code it.
Inefficiently re-learning over and over patterns that can be more explicitly encoded as smart inductive biases for better sample efficiency is what ML research is.
The "bitter lesson" doesn't mean "throw the towel and your brain and just buy more GPUs". It means that the inductive biases / modeling strategies that win will always be the ones that are more hardware-friendly.
I agree with you that learning certain things is wasteful.
For instance, one could imagine an RNN that learned to do some approximation of tree search for game playing Chess and Go. But we have very good reason to think that tree search is basically exactly what you want, so even systems like AlphaGo have the tree search implemented outside the neural net, but still using a learned system to heuristically guide the tree search.
The reference to the bitter lesson here is that feature engineering has, thus far, typically lost out to more general end-to-end methods in the long run.
This paper tries to do feature engineering by hand-coding an exponentially decaying mechanism, where tokens further in the past are assumed to be less important.
My comment is that this type of hand-engineering will lose out to methods that are more end-to-end learned. These methods do not necessarily need to be hugely computationally intensive ("buy more GPUs").
That said, I could see it being the case that in the short term, we do just buy more GPUs, learn a general end-to-end algorithm, but eventually figure out how to re-implement that end-to-end learned algorithm in code significantly more efficiently.
By and large, we don't really know what inductive biases we ought to be shoving in to models. Sometimes we think we do, but we're wrong more often than not. So methods with the least inductive biases work better.
> The bitter lesson [1] is going to eventually come for all of these. Eventually we'll figure out how to machine-learn the heuristic rather than hard code it.
Inefficiently re-learning over and over patterns that can be more explicitly encoded as smart inductive biases for better sample efficiency is what ML research is.
The "bitter lesson" doesn't mean "throw the towel and your brain and just buy more GPUs". It means that the inductive biases / modeling strategies that win will always be the ones that are more hardware-friendly.