Is there a decent way of converting to a structure with a very constrained vocabulary?
For example, given some input text, converting it to something like {"OID-189": "QQID-378", "OID-478":"QQID-678"}. Where OID and QQID dictionaries can be e.g. millions of different items defined by a description. The rules for mapping could be essentially what looks closest in semantic space to the descriptions given in a dictionary.
I know this should be able to be solvable by local LLMs and bert cosine similarity (it isn't exactly, but it's a start on the idea), but is there a way to do this with decoder models rather than encoder models with other logic?
You can train custom GPT 3 models, and Azure now has vector database integration for GPT-based models in the cloud. You can feed it the data, and ask it for the embedding lookup, etc...
You can also host a vector database yourself and fill it up with the embeddings from the OpenAI GPT 3 API.
I know this should be able to be solvable by local LLMs and bert cosine similarity (it isn't exactly, but it's a start on the idea), but is there a way to do this with decoder models rather than encoder models with other logic?