Why would anyone expect cosine-similarity to be a useful metric? In the real wor...

VHRanger · on March 12, 2024

The scale of word embeddings (eg. Distance from 0) is mainly measuring how common the word is in the training corpus. This is a feature of almost all training objectives since word2vec (though some normalize the vectors).

Uncommon words have more information content than common words. So, common words having larger embedding scale is an issue here.

If you want to measure similarity you need a scale free measure. Cosine similarity (angle distance) does it without normalizing.

If you normalize your vectors, cosine similarity is the same as Euclidean distance. Normalizing your vectors also leads to information destruction, which we'd rather avoid.

There's no real hard theory why the angle between embeddings is meaningful beyond this practical knowledge to my understanding.

liminal · on March 12, 2024

> If you normalize your vectors, cosine similarity is the same as Euclidean distance.

If you normalize your vectors, cosine similarity is the same as dot product. Euclidean distance is still different.

VHRanger · on March 12, 2024

Oh, thanks for the correction.

If all the vectors are on the unit ball, then cosine = dot product. But then the dot product is a linear transformation away from the euclidean distance:

https://math.stackexchange.com/questions/1236465/euclidean-d...

If you're using it in a machine learning model, things that are one linear transform away are more or less the same (might need more parameters/layers/etc.)

If you're using it for classical statistics uses (analytics), right, they're not equivalent and it would be good to remember this distinction.

gbjw · on March 12, 2024

To be very explicit, if |x| = |y| = 1, we have |x - y|^2 = |x|^2 - 2xy + |y|^2 = 2 - 2xy = 2 - 2* cos(th). So they are not identical but minimizing the Euclidian distance of two unit vectors is the same as maximizing the cosine similarity.

montebicyclelo · on March 12, 2024

Cosine-similarity is a useful metric. The cases where it is useful are models that have been trained specifically to produce a meaningful cosine distance, (e.g. OpenAI's CLIP [1], Sentence Tranformers [2]) - but these are the types of models that the majority of people are using when they use cosine distances.

> It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words... it's the directions and distances to nearby objects that matters most

Cosine similarity is a kind of "delta" / inverse distance between the represenation of two entities, in the case of these models.

[1] https://arxiv.org/abs/2103.00020

[2] https://www.sbert.net/docs/training/overview.html

necroforest · on March 12, 2024

cosine similarity is (isomorphic to) "distances to nearby objects". and not all embeddings are word embeddings.

VHRanger · on March 12, 2024

It's isomorphic when vectors are normalized, otherwise it's angle distance, not position distance

jameshart · on March 12, 2024

It’s a mistake to think of vectors as coordinates of objects in space, though. You can visualize them like that, but that’s not what they are. The vectors are the objects.

A vector is just a list of n numbers. Embedded into a n dimensional space, a vector is a distance in a direction. It isn’t ’the point you get to by going that distance in that direction from the origin of that space’. You don’t need as space to have an origin for the embedding to make sense - for ‘cosine similarity’ to make sense.

Cosine similarity is just ‘how similar is the direction these vectors point in’.

The geometric intuition of ‘angle between’ actually does a disservice here when we are talking about high dimensional vectors. We’re talking about things that are much more similar to functions than spatial vectors, and while you can readily talk about the ‘normalized dot product’ of two functions it’s much less reasonable to talk about the ‘cosine similarity’ between them - it just turns out that mathematically those are equivalent.

VHRanger · on March 12, 2024

Fair enough.

I think people skip over that the vectors are the result of the minimization of the objective.

That objective is roughly the same since word2vec. GLoVe is mathematically equivalent. LLMs are also equivalent.

For a LM, the objective function is still roughly the same. Maximizing probability of the next token conditional on previous tokens.

This means the embedding vector of a token minimizes distance to tokens that come before it often, and maximizes distance to those that don't.

rdedev · on March 12, 2024

From my experience trying to train embeddings from transformers, using cosine similarity is less restrictive for the model than euclidean distance. Both works but cosine similarity seems to have slightly better performance.

Another thing you have to keep in mind is that these embeddings are in n dimensional space. Intuitions about the real world does not apply there

kevindamm · on March 12, 2024

The word2vec inspired tricks like king-man+woman only work if the embedding is trained with synonym/antonym triplets to give them the semantic locality that allows that kind of vector math. This isn't always done, even some word2vec re-implementations skip this step completely. Also, not all embeddings are word embeddings.

mikewarot · on March 12, 2024

My understanding was that Word2Vec[1] was trained on Wikipedia and other such texts, not artificially constructed things like the triplets you suggest. There's an inherent structure present in human languages that enable the "magic" of embeddings to work, as far as I can tell.

[1] https://code.google.com/archive/p/word2vec/

itronitron · on March 12, 2024

Has there been any rigorous evaluation of word2vec calculating 'king-man+woman=queen' associations? I only recall the author providing some cherry-picked examples from their results, which I suppose makes it a seminal AI paper.

kevindamm · on March 12, 2024

The original paper included source, and that has their test data and results -- it gets ~77% accuracy on about 20k example word analogies (with 99.7% coverage), and 78% accuracy with phrases with 77% coverage (and a much smaller data set, 3,218 examples). You can see the test sets here:

https://github.com/tmikolov/word2vec/blob/master/questions-w...

https://github.com/tmikolov/word2vec/blob/master/questions-p...

and you could see how much better LLMs do on the same 20k examples.

sp332 · on March 12, 2024

A lot more info about how it worked here https://blog.esciencecenter.nl/king-man-woman-king-9a7fd2935... and an interactive demo https://dash.gallery/dash-word-arithmetic/ (blog post about the demo https://medium.com/plotly/understanding-word-embedding-arith...)

0xdeadbeefbabe · on March 12, 2024

If a paper attacks plus and minus it ought to cite this one.

kevindamm · on March 12, 2024

Ah, yes I may have mis-remembered or misunderstood when looking at the training data and model definition, it would have been about a decade ago now. Or perhaps I was thinking of an unrelated experiment to use chosen analogies during training but yeah that isn't part of the original paper.

The training process from the original Mikolov et al. paper only uses the analogy examples (questions-words.txt and questions-phrases.txt) to measure accuracy after training: https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50...

_t89y · on March 12, 2024

There is an inherent structure in language. Embeddings do not and will not capture it. It's why they do not work. Their ability to form grammatical sentences with high accuracy is part of the illusion that you have been understood.

itronitron · on March 12, 2024

>> the delta between two word embeddings, gives a direction, and the magic is from using those directions

A direction can be given in terms of an angle measure, such as cosine.