Huh, are the image vector embeddings implicitly doing OCR as well? Because it seems like the meme search is pulling from the text as well as images, though it's not entirely clear.
CLIP does not have explicit OCR support, but it does somewhat coincidentally have a slight understanding of text. This is explained by training captions containing (some of) the text that is in the image.
I think the SigLIP models' dataset (WebLi) includes OCRed things too, so they have very good text understanding. I tested a bunch of things for my own meme search engine.