Why would anyone expect cosine-similarity to be a useful metric? In the real word, the arbitrary absolute position of an object in the universe (if it could be measured) isn't that important, it's the directions and distances to nearby objects that matters most.
It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words. The oft cited example is King-Man+Woman = Queen [1]
The scale of word embeddings (eg. Distance from 0) is mainly measuring how common the word is in the training corpus. This is a feature of almost all training objectives since word2vec (though some normalize the vectors).
Uncommon words have more information content than common words. So, common words having larger embedding scale is an issue here.
If you want to measure similarity you need a scale free measure. Cosine similarity (angle distance) does it without normalizing.
If you normalize your vectors, cosine similarity is the same as Euclidean distance. Normalizing your vectors also leads to information destruction, which we'd rather avoid.
There's no real hard theory why the angle between embeddings is meaningful beyond this practical knowledge to my understanding.
If all the vectors are on the unit ball, then cosine = dot product. But then the dot product is a linear transformation away from the euclidean distance:
If you're using it in a machine learning model, things that are one linear transform away are more or less the same (might need more parameters/layers/etc.)
If you're using it for classical statistics uses (analytics), right, they're not equivalent and it would be good to remember this distinction.
To be very explicit, if |x| = |y| = 1, we have |x - y|^2 = |x|^2 - 2xy + |y|^2 = 2 - 2xy = 2 - 2* cos(th). So they are not identical but minimizing the Euclidian distance of two unit vectors is the same as maximizing the cosine similarity.
Cosine-similarity is a useful metric. The cases where it is useful are models that have been trained specifically to produce a meaningful cosine distance, (e.g. OpenAI's CLIP [1], Sentence Tranformers [2]) - but these are the types of models that the majority of people are using when they use cosine distances.
> It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words... it's the directions and distances to nearby objects that matters most
Cosine similarity is a kind of "delta" / inverse distance between the represenation of two entities, in the case of these models.
It’s a mistake to think of vectors as coordinates of objects in space, though. You can visualize them like that, but that’s not what they are. The vectors are the objects.
A vector is just a list of n numbers. Embedded into a n dimensional space, a vector is a distance in a direction. It isn’t ’the point you get to by going that distance in that direction from the origin of that space’. You don’t need as space to have an origin for the embedding to make sense - for ‘cosine similarity’ to make sense.
Cosine similarity is just ‘how similar is the direction these vectors point in’.
The geometric intuition of ‘angle between’ actually does a disservice here when we are talking about high dimensional vectors. We’re talking about things that are much more similar to functions than spatial vectors, and while you can readily talk about the ‘normalized dot product’ of two functions it’s much less reasonable to talk about the ‘cosine similarity’ between them - it just turns out that mathematically those are equivalent.
From my experience trying to train embeddings from transformers, using cosine similarity is less restrictive for the model than euclidean distance. Both works but cosine similarity seems to have slightly better performance.
Another thing you have to keep in mind is that these embeddings are in n dimensional space. Intuitions about the real world does not apply there
The word2vec inspired tricks like king-man+woman only work if the embedding is trained with synonym/antonym triplets to give them the semantic locality that allows that kind of vector math. This isn't always done, even some word2vec re-implementations skip this step completely. Also, not all embeddings are word embeddings.
My understanding was that Word2Vec[1] was trained on Wikipedia and other such texts, not artificially constructed things like the triplets you suggest. There's an inherent structure present in human languages that enable the "magic" of embeddings to work, as far as I can tell.
Has there been any rigorous evaluation of word2vec calculating 'king-man+woman=queen' associations? I only recall the author providing some cherry-picked examples from their results, which I suppose makes it a seminal AI paper.
The original paper included source, and that has their test data and results -- it gets ~77% accuracy on about 20k example word analogies (with 99.7% coverage), and 78% accuracy with phrases with 77% coverage (and a much smaller data set, 3,218 examples). You can see the test sets here:
Ah, yes I may have mis-remembered or misunderstood when looking at the training data and model definition, it would have been about a decade ago now. Or perhaps I was thinking of an unrelated experiment to use chosen analogies during training but yeah that isn't part of the original paper.
There is an inherent structure in language. Embeddings do not and will not capture it. It's why they do not work. Their ability to form grammatical sentences with high accuracy is part of the illusion that you have been understood.
It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words. The oft cited example is King-Man+Woman = Queen [1]
When did this view fall from favor?
[1] https://www.technologyreview.com/2015/09/17/166211/king-man-...