> There should be somewhere in the corpus, "the is spelled t h e" that this system can use to pull this out.
Such an approach would require an enormous table, containing all written words, including first and last names, and would still fail for made up words.
A more tractable approach would be to give it the map between the individual tokens and their letter component, but then you have the problem that this matching depends on the specific encoding used by the model (it varies between models). You could give it to the model during fine-tuning though.
Such an approach would require an enormous table, containing all written words, including first and last names, and would still fail for made up words.
A more tractable approach would be to give it the map between the individual tokens and their letter component, but then you have the problem that this matching depends on the specific encoding used by the model (it varies between models). You could give it to the model during fine-tuning though.