The problem with GPT and other LLMs is that they don't tokenize words at a word ...

DougBTX · on May 26, 2023

Do you strictly need that level of tokenisation precision to meet your high-level goals?

morkalork · on May 26, 2023

This is my first reaction as well. Talking about tokenization and POS tagging is getting lost in the weeds when one has goals like this:

>I also want to be able to assess how much of the text is about a given topic, so that if I'm interested in reading a detective story from e.g. the Project Gutenberg collection, I don't want it to pick up a story where a detective is only mentioned in one paragraph.

This more like NLU than an NLP problem isn't it? It's like tracking how much of a Harry Potter book contains Voldemort content without knowing ahead of time that he may be referred to as He Who Must Not Be Named, You-Know-Who, The Dark Lord and so on. One would have to first identify the thing you're interested in, then learn when characters/the author invent new ways to refer to it, and carry all those forwards to find new instances. Fun!

rhdunn · on May 26, 2023

I also want to tag and highlight those parts of the document. For that, I need to know where the label starts and ends, which you can't really do when you don't have control over the tokens.

It's also hard to write custom inference/tagging rules, like in the case you mentioned w.r.t. Voldemort, if you don't know what the tokens look like.

chaxor · on May 26, 2023

Spacy is a decent suggestion here. They have pretty good ways of writing tagging rules.

All of this does seem to be extremely excessive to choose a book genre though. I would imagine the number of books after a simplistic clustering technique would be rather small to flip through, so I really don't understand the use case at all.

If you have very few books (few thousands) then you can apply more fine grained analyses in reasonable amounts of computation, such as contextualized embedding methods. But if the point is to select a book, there no real benefit since the simple 2 second term frequency methods would narrow choices down to only a few books.

If you have billions of books, contextualized embeddings become quite expensive to produce and use (several weeks or months of processing, petabytes of storage, etc), so it's not really feasible as an individual, But the extra querying capability does help narrow the large set down.

viksit · on May 26, 2023

perhaps a spacy pipeline using gpt and huggingface?