Note that this is similar to how for example MPEG does it with the intermediate frames and motion vectors. First it encodes a full frame using basically regular JPEG, then for the next frames it first does motion estimation by splitting the image into 8x8 blocks and then for each block it tries to find the position in the previous frame which best fits it. The difference in position is called the motion vector for that block.
It can then take all the "best fit" block from the previous frame and use it to generate a prediction of the next frame. It then computes the difference between the prediction and the actual frame, and stores this difference along with the set of motion vectors used to generate the prediction image.
If nothing much has changed, just camera moving a bit about, the difference between the prediction and the actual frame data is very small and can be easily compressed. Also, the range of the motion vectors is typically limited to +/-16 pixels, and you only need one per block, so they take up very little space.
It can then take all the "best fit" block from the previous frame and use it to generate a prediction of the next frame. It then computes the difference between the prediction and the actual frame, and stores this difference along with the set of motion vectors used to generate the prediction image.
If nothing much has changed, just camera moving a bit about, the difference between the prediction and the actual frame data is very small and can be easily compressed. Also, the range of the motion vectors is typically limited to +/-16 pixels, and you only need one per block, so they take up very little space.