First of all, let me explain what “hapax legomena” is: it refers to words (and, by extension, concepts) that occurred just once throughout an entire corpus of text. An example is the word “hebenon”, occurring just once within Shakespeare’s Hamlet. Therefore, “hebenon” is a hapax legomenon. The “hapax legomenon” concept itself is a kind of hapax legomenon, IMO.

According to Wikipedia, hapax legomena are generally discarded from NLP as they hold “little value for computational techniques”. By extension, the same applies to LLMs, I guess.

While “hapax legomena” originally refers to words/tokens, I’m extending it to entire concepts, described by these extremely unknown words.

I am a curious mind, actively seeking knowledge, and I’m constantly trying to learn a myriad of “random” topics across the many fields of human knowledge, especially rare/unknown concepts (that’s how I learnt about “hapax legomena”, for example). I use three LLMs on a daily basis (GPT-3, LLama and Gemini), expecting to get to know about words, historical/mythological figures and concepts unknown to me, lost in the vastness of human knowledge, but I now know, according to Wikipedia, that general LLMs won’t point me anything “obscure” enough.

This leads me to wonder: are there LLMs and/or NLP models/datasets that do not discard hapax? Are there LLMs that favor less frequent data over more frequent data?

  • Audalin@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    5 days ago

    My intuition:

    • There’re “genuine” instances of hapax legomena which probably have some semantic sense, e.g. a rare concept, a wordplay, an artistic invention, an ancient inside joke.
    • There’s various noise because somebody let their cat on the keyboard, because OCR software failed in one small spot, because somebody was copying data using a noisy channel without error correction, because somebody had a headache and couldn’t be bothered, because whatever.
    • Once a dataset is too big to be manually reviewed by experts, the amount of general noise is far far far larger than what you’re looking for. At the same time you can’t differentiate between the two using statistics alone. And if it was manually reviewed, the experts have probably published their findings, or at least told a few colleagues.
    • Transformers are VERY data-hungry. They need enormous datasets.

    So I don’t think this approach will help you a lot even for finding words and phrases. And everything I’ve said can be extended to semantic noise too, so your extended question also seems a hopeless endeavour when approached specifically with LLMs or big data analysis of text.