Whether your not the cosine similarity of either pair is higher depends on the mapping you create from the strings to the embedding vector. That mapping can be whichever function you choose, and your result will be entirely dependent on that.
If you choose a straight linear mapping of tokens to a number, then you'd be right.
Extending that, if you choose any mapping which does not do a more extensive remapping from raw syntactic structure to some sort of semantic representation, you'd be right.
But hence why we increasingly use models to create embeddings instead of simpler approaches before applying a similarity metric, whether cosine similarity or other.
Put another way, there is no inherent reason why you couldn't have a model where the embeddings for 1 and 3 are identical even, and so it is meaningless to talk about the cosine similarity of your sentences without setting out your assumptions about how you will created embeddings from them.
> meaningless to talk about the cosine similarity of your sentences without setting out your assumptions about how you will created embeddings from them.
I agree, but from generics POV, you have to settle on a few things to compare between models. If you can't, then benchmarks are useless too outside of extremely narrow measures.
I only address structure in the parent, and sure, it can be too generic of a statement by only touching on structure. But I would almost assert structure is still an important feature, and I would almost assert that it is required or otherwise a dominant feature when you want to deliver a product for general use.
I don't think I get too much more incorrect going beyond a few dimensions given this.
> Discrete entities are often embedded via a learned mapping to dense real-valued vectors in a variety of domains.
Already from that point, it is clear that a comparison based on the similarity of the textual version of the sentences is irrelevant to the evaluation in the paper. The paper consistently talk in terms of "learned embeddings" rather than simplistic direct mappings of words.
It's meaningful to talk about cosine similarity for anything that you can quantify in ways such that the cosine similarity reflects a measure you care about. Same applies for any function. If it works, it's meaningful to talk about it whether or not it has a reasonable interpretation beyond that.
If you choose a straight linear mapping of tokens to a number, then you'd be right.
Extending that, if you choose any mapping which does not do a more extensive remapping from raw syntactic structure to some sort of semantic representation, you'd be right.
But hence why we increasingly use models to create embeddings instead of simpler approaches before applying a similarity metric, whether cosine similarity or other.
Put another way, there is no inherent reason why you couldn't have a model where the embeddings for 1 and 3 are identical even, and so it is meaningless to talk about the cosine similarity of your sentences without setting out your assumptions about how you will created embeddings from them.