At the end of the article, a single image of the bookshelf uploaded to Gemini is 258 tokens. Gemini then responds with a listing of book titles, coming to 152 tokens.
Does anyone understand where the information for the response came from? That is, does Gemini hold onto the original uploaded non-tokenized image, then run an OCR on it to read those titles? Or are all those book titles somehow contained in those 258 tokens?
If it's the later, it seems amazing that these tokens contain that much information.
I'm not sure about Gemini, but OpenAI GTP-V bills at roughly a token per 40x40px square. It isn't clear to me these actually processed as units, but rather it seems like they tried to approximate the cost structure to match text.
Remember, if it's using a similar tokeniser to GPT-4 (cl100k_base iirc), each token has a dimension of ~100,000.
So 258x100,000 is a space of 25,800,000 floats, using f16 (a total guess) that's 51.6kB, probably enough to represent the image at ok quality with JPG.
But they are not a "single integer" either as in, like a byte... I don't have any good examples but I'm pretty sure the tokens are in the range of thousands of dimensions. It has to encode the properties of the patch of the image it derives from, and even a small 40x40 RGB pixel patch has plenty of information you have to retain.
In the given example the video was condensed to a sequence of 258 tokens, and clearly it was a very minimalist, almost-entirely-ocr extraction from the video.
Yeah but we're not talking about LLMs here but vision transformers, which don't use the same type of token vocabulary to produce embeddings from the input as the LLMs do. The pixel data is much more dense than a few characters is, per token.
I looked it up - the original ViT models directly projected for example 16x16 pixel patches into 768-dimensional "tokens". So a 224x224 image ended up as 14*14=196 "tokens" each of which is a 768-dimensional vector. The positional encoding is just added to this vector.
>Yeah but we're not talking about LLMs here but vision transformers
We ultimately are. Gemini is a multimodal model whose core function is an LLM. This doesn't mean that everything flows through the same pathway -- different modalities have different paths -- but eventually there is fusion through which a common representations appears. It's where the worlds combine. That parlance is often tokens, though it obviously depends upon the architecture and we simply don't have those details for Gemini (the paper is extremely superficial). The fact that it will ingest massive videos and then post-facto answer arbitrary queries on it is a good clue, however.
>This blog-post has the specific number
It's a great link and an enjoyable read, and while the ViT plays a critical role in virtually all image analysis pipelines, including in Gemini where it is a part of OCR, object detection, etc, the numbers you are referring to do not map to tokens.
E.g. the 768 dimensions are nothing more than the underlying image data for the tile. e.g. 16x16x3 channels. I'm unaware of any ViT resources that refers to those vectors (vectorized because that's the form GPUs like) as tokens. This system could lazily reuse it, but the way processing happens in ViTs would make that a completely irrational overlap of terms.
The role that a token plays in that description is the classifier -- basically the output that classifies each tile.
Ultimately the number of tokens that Google or OpenAI assign to processing an image or video is a billing artifact because tokens are the measure by which things are billed. However you can ask these systems for the tokens representing an image and it will be exactly what one would expect. Indeed, the brilliance of image (and thus video) analysis in these multimodal systems is not nearly as deep as first glances might assume, and often it can derive nothing more than the most obvious classifications. e.g. classifications made without knowing anything about what the user specifically wants. It is usually fantastic at things like OCR, which happens to be a very common need.
These systems obviously have different usage patterns. I can do simultaneous processing where the image and command work in concert, image analysis deep diving on specifically those elements that are wanted (but that would otherwise be ignored). Or I can do the classic feed a video or an image and then ask questions where the dominant model is to tokenize the video or images using the common flow (OCR, object detection, etc), create a token narrative, and then answer the question from the narrative.
The whole matter of tokens from video is one that has a lot of ambiguity, and is often presented as if these are some unique weird encoding of the contents of the video.
But logically the only possible tokenization of videos (or images, or series of images ala video) is basically an image to text model that takes each frame and generates descriptive language -- in English in Gemini -- to describe the contents of the video.
e.g. A bookshelf with a number of books. The books seen are "...", "...", etc. A figurine of a squirrel. A stuffed owl.
And so on. So the tokenization by design would include the book titles as the primary information, as that's the easiest, most proven extraction from images.
From a video such tokenization would include time flow information. But ultimately a lot of the examples people view are far less comprehensive than they think.
It isn't surprising that many demonstrations of multimodal models always includes an image with text on it somewhere, utilizing OCR.
>The visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al.,2022), with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b).
These are the papers Google say the multimodality in Gemini is based on.
The images are encoded. The encoding process tokenizes the images and the transformer is trained to predict text with both the text and image encodings.
There is no conversion to text for Gemini. That's not where the token number comes from.
>As much as I would love to waste my time replying again to your nonsense, instead I'll just politely chuckle and move on. Good luck.
You have your head so far up your ass even direct confirmation from the model builders themselves won't sway you. The comment wasn't for you. The comment is linked sources for the original poster and for the curious.
You see I don't have to hide behind a veneer of "Trust me bro. It works like this".
>even direct confirmation from the model builders themselves
Linking papers that you clearly haven't read and can't contextually apply -- as with the ViT or your misunderstanding of image tiling -- is not the sound strategy you hope it is. It doesn't confirm your claims.
I'm not asking anyone to "Trust me bro". So...have you called the Gemini Pro 1.5 API and tokenized an image or a video yet?
There is a certain element of this that is just spectacularly obvious to anyone who spent even a moment of critical thought -- if they're so capable -- on it. Your claim is that a high resolution image is tiled to a 16x16 array...and the magic model can at some later point magically on demand extract any and all details, such as OCR, from that 16x16. This betrays a fundamental ignorance of even the most basic of information theory.
Again, I would love to just block you and avoid the defensive insults you keep hurling, but this site lacks the ability. Stop replying to me, however many more contextually nonsensical citations you think will save face. Thanks.
This is not at all how this works. There's no separate model. Yes there's unique tokenization, if not the video as a whole then for each image. The whole video is ~1800 tokens because Gemini gets video as a series of images in context at 1 frame/s. Each image is about 258 tokens because a token in image transformer terms is literally a patch of the image.
You can literally convert the tokens returned from a video to text. What do you even think tokens are?
Like seriously, before you write another word on this feel free to call the API and retrieve tokens for a video or image. Now go through the magical process of converting those tokens back to their text form. It isn't some magical hyper-dimensional, inside-out spatial encoding that yields impossible compression.
This process is obvious and logical if actually thought through.
>Each image is about 258 tokens
Because Google set that as the "budget" and truncates accordingly. Again, call the API with an image or video and then convert those tokens to text.
>You can literally convert the tokens returned from a video to text. What do you even think tokens are?
Tokens are patches of each image.
It's amazing to me how people will confidently spout utter nonsense. It only takes looking at the technical report for the Gemini models to see that you're completely wrong.
>The visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al.,2022), with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b).
>It's amazing to me how people will confidently spout utter nonsense.
Ok.
You seem to be conflating some things, evident when you suddenly dropped the ViT paper as evidentiary. During the analysis of images, tiles and transformers (such as a ViT) are used. This is the model of processing the image to obtain useful information, such as to do OCR (you might notice that that word used repeatedly in the Google paper).
But to actually use the image, context has to be drawn from it. This is pretty bog standard OCR, object detection and classification, sentiment analysis, etc. This yields tokens.
Have you called the API and generated tokens from an image yet? Try it. You'll find they aren't as magical and mysterious as you believe, and your quasi-understanding of a ViT is not relevant to the tokens retrieved from a multimodal LLM.
There is the notion of semantic image tokens, which is an inner property of the analysis engine for images (and, conversely, the generation engine) but it is not what we're talking about. If an image was somehow collapsed into a 16x16 array of integers and amazingly it could still tell you the words on books and the objects that appear, that would be amazing. Too amazing.
>But to actually use the image, context has to be drawn from it. This is pretty bog standard OCR, object detection and classification, sentiment analysis, etc. This yields tokens
None of that is necessary for an Autoregressive Transformer. You can train the transformer to predict text tokens given interleaved image and text input tokens in the context window.
Google have already told us how this works. Read the Flamingo or Pali papers. You are wrong. Very wrong.
It's incredible that people will crucify LLMs for "hallucinating" but then there are humans like you running around.
Well, aside from the edited in bit about OCR. Of course there isn't a separate run to do OCR because that was literally the first step during image analysis. You know, before the conversion to simple tokens.
You understand that OCR is the process of extracting text from images, right? You know, such as what Gemini does, and they reference repeatedly in their paper. I have absolutely no idea why you repeatedly make some bizarre distinction about it being a "separate process".
Okay, it's been fun talking to you but feel free to have the last word. Good luck.
Does anyone understand where the information for the response came from? That is, does Gemini hold onto the original uploaded non-tokenized image, then run an OCR on it to read those titles? Or are all those book titles somehow contained in those 258 tokens?
If it's the later, it seems amazing that these tokens contain that much information.