The sequence of model activations is being compressed. s4 treats each activation channel as an independent sequence, and applies a learned version of the Laplace transform, and drops less-significant components.
This is similar to basic compression you get with PCA or Fourier transforms. These transforms re fully invertible, until you drop the less significant components. Dropping less-significant components lets you reconstruct some degraded version of the input, and the transform makes it easy to pick the right components to drop.
This is similar to basic compression you get with PCA or Fourier transforms. These transforms re fully invertible, until you drop the less significant components. Dropping less-significant components lets you reconstruct some degraded version of the input, and the transform makes it easy to pick the right components to drop.