Best LLM/NLP for finding hapax legomena?

Daemon Silverstein · 2 months ago

Best LLM/NLP for finding hapax legomena?

jacksilver@lemmy.world · 2 months ago

Not the original commenter, but to add some more context. The words usually removed in traditional NLP applications are called “stop words” and are usually more “non-valuable” words like “the, and, but”.

However, LLMs don’t skip stop words, they actually need them to better understand the context of the sentence. That being said, LLMs are not great for statistical analysis and a simple word count would be more consistent and faster.

Daemon Silverstein · 2 months ago

They cut anything but the median part of the dataset: the most frequent words, as you said, as well as words that occurred just once across the entire dataset. At least it’s what Wikipedia states on Hapax legomenon:

In the fields of computational linguistics and natural language processing (NLP), esp. corpus linguistics and machine-learned NLP, it is common to disregard hapax legomena (and sometimes other infrequent words), as they are likely to have little value for computational techniques. This disregard has the added benefit of significantly reducing the memory use of an application, since, by Zipf’s law, many words are hapax legomena.[13]

I thought of LLMs because they’re trained on really, really big and vast datasets, datasets that we normally can’t really have access, let alone use it to compute in our personal computers (mine is a 12GB RAM Linux laptop, it’s a good Core i5 computer, but not enough to really big datasets). I mean, there are lots of downloadable datasets in platforms such as Kaggle and Huggingface, as well as internet archives of plain-text articles, books, BBS and so on, but I guess it’s just a tiny fraction of the datasets used for OpenAI’s GPT, Meta’s Llama and Google’s Gemini training. And I have a “gut feeling” that somewhere, somehow, those least-mentioned things (words, entire concepts, places, mythological figures and ancient deities, forgotten philosophical nomenclatures and so on) are lurking and waiting to be excavated from beneath these vast depths of datasets.

Maybe the ideal scenario would be having entire datasets and applying parsers and tokenizers to all of them (as the original comment suggested, parsers such as PEG or FLEX), then cut the slice of words/tokens that appeared just once across all of them. In order to it to properly work, there’s really a need of several datasets. For example: I tried to do it with two versions of the bible (because it’s an example of a long book readily available throughout the Web and ready to be parsed; I used both a JSON containing JKV verses and a JSON containing BBE verses) and I got around 3200 unique occurrences using the “Poor man’s technique” I described on the other comment (Node.js + Regex + JS dictionary object to count occurrences, not the best of approaches). If I’d to add more English versions/translations, maybe this would converge to more specific unique words.