> so memory is not used Not used for more than the duration of inference, but de...

GistNoesis · on Dec 4, 2022

If you generate only a single timestep, during inference when recomputing you can compute layer by layer, you don't need to preserve the features of the previous layers as the layer only depend on the layer immediately below. So your memory need don't depend on the number of layers.

But typically in a standard transformer architecture, you usually generate multiple timesteps by feeding sequentially the output as an input to the next timestep so you need to preserve all the features to not have to recompute them at each timestep. So your memory depends again on the number of layer of your network.

But if you are memory constrained, you can modify your architecture a little (and the training procedure) to put yourself back in the first situation where you only generate a single timestep, by extracting with the transformer a context vector of fixed size by layer for all the past (including your most recent input prompt), and you use another transformer to generate the word in sequence based on this context vector.