Sure, you can slice a video up into images and process them separately - that's apparently how Gemini Pro works, it uses one frame from every second of video.
But you still need a REALLY long context length to work with that information - the magic combination here is 1,000,000 tokens combined with good multi-model image inputs.
But you still need a REALLY long context length to work with that information - the magic combination here is 1,000,000 tokens combined with good multi-model image inputs.