It doesn’t appear to be using the sound from the video, but elsewhere in the rep...

It doesn’t appear to be using the sound from the video, but elsewhere in the report for Gemini 1.5 pro it mentions it can handle sound directly as an input, without first transcribing it to text (including a chart that makes the point it’s much more accurate than transcribing text with whisper and then querying it using GPT-4).

But I don’t think it went into detail about how exactly that works, and I’m not sure if the API/front end has a good way to handle that.