Best LLM/NLP for finding hapax legomena?

Daemon Silverstein · 2 months ago

Best LLM/NLP for finding hapax legomena?

hendrik@palaver.p3x.de · edit-2 2 months ago

I don’t think LLMs are the right tool for this. They’re built to find statistically likely correlations, patterns etc. That’s why they tend to give the correct answer (at least to simple questions) and why they produce legible output in the first place. And you want kind of the opposite of that. Somehing that’s unlikely. But that goes against how they work. Of course you can ask them to surprise you and do something unexpected. And they’ll try to do something with that. I doubt it’ll do anything and I think it’s a fundamental limitation. A better tool would be traditional statistics, going through datasets and counting frequencies and you can find your hapax legomena precisely. And I mean some linguists probably already studied this and you could also read their publications…

I’ve also fooled around with LLMs and in my experience, they don’t perform well on uncommon things. If it’s barely in their dataset, they’ll struggle with the concept and fabricate something. I’ve never got any correct output from them in those cases.

(And that’s not the only fundamental limitation. They also can’t count the number of 'r’s in strawberry until now, when someone put the correct answer into their dataset. It takes them immense effort to learn maths and do calculations, because they’ve been built for words. And many other things. And your question is very similar to the ‘strawberry’ thing. LLMs are known to fail in these cases because of how they work. At least currently.)

Daemon Silverstein · 2 months ago

I’ve also fooled around with LLMs and in my experience, they don’t perform well on uncommon things. If it’s barely in their dataset, they’ll struggle with the concept and fabricate something.

Me too. They hallucinate. And sometimes I learn things through these hallucinations, when I ask them about an uncommon thing. However, they won’t give the uncommon thing, I’m the one who usually feeds the prompt with the uncommon thing for them to hallucinate. Indeed, what I’m seeking is likely the exact opposite of what’s expected for LLMs: the extremely uncommon, close to complete hallucination and stochastic behavior.

A better tool would be traditional statistics, going through datasets and counting frequencies and you can find your hapax legomena precisely

I’m used to do it in a laughable “poorman’s way” via Node.js + RegEx + JS key-value dictionary object (whose key is the token and the value is a number that increases as this token is found via interaction), downloading some JSON/TXT/CSV dataset, reading and parsing it, then iterating over its tokens. It consumes a lot of memory, time and CPU (yet I try to use a sleep/delay between N iterations in order to free the CPU from high loads). I know there are better ways, and a temperature/param-adjustable LLM seemed for me as a better way, hoping that there’s some exception across the many LLMs publicly available that wouldn’t discard hapaxes.

And I mean some linguists probably already studied this and you could also read their publications

The things I’m willing to discover and learn weren’t/aren’t so well studied. I mean, human knowledge is a really vast universe of concepts, names and ideas, some of them got buried by time (sometimes centuries or millennia). Someone has to dig them because they could hold value, knowledge value. One of my purposes with this inquiry over the unknown is to find these really forgotten ideas and concepts, things never studied before, and try and study them, learn about them. That’s how things were rediscovered throughout the entire human history: treasures are buried by the passage of time, and a curious person digs them, and humanity gets to know them once again. And a potential source of knowledge lurking in oblivion is the big data, or big datasets.

hendrik@palaver.p3x.de · edit-2 2 months ago

Is going trough text really that compuationally expensive? I guess the english language only uses a few thousand words frequently, plus some names and rare words. I’d imagine you can comfortably keep them in RAM next to a counter variable for each bucket. That should allow going through practically any book on earth on a regular computer, if I’m not mistaken. I’m not sure if that’s I/O bound or CPU bound, but it shouldn’t be that hard. It’s something that gets taught in the first 3 semesters of computer science at university.

Regarding the hallucinations: There are two use-cases: If you want some creative output that doesn’t need to be correct, you’re fine. You’ll be doing art like the people who manipulate electronic child toys and music instruments to coerce some strange sounds out of them. I think that’s calles “circuit bending”. You could also de-tune the parameters of an LLM, tinker around a bit and mess with the settings. Feed it random garbage prompts and see what it’ll do. I guess that’s an interesting arts project.
But if you want something that has to do with factuality or needs to be correct, the hallucinations will get in your way. A “hapax legomena” or unique word is a well-defined (objective) thing. It doesn’t really help if the LLM returns some pretend answer. It might look interesting at first, but it won’t be a unique word by real-world definition. And that’s why I don’t think an LLM can help in this case.

I’ve tried asking it the title and author of some children’s story which I heard at a first communion ceremony at church. I tried googling that but all the church pages attribute the story to some random authors. So I tried asking Llama and ChatGPT but they wholehartedly make something up. I’ve tried like 20 times but all they return is made up and false. So it doesn’t help. And I guess those more contemporary religous books just aren’t in any dataset. And the LLM will just do something random in this case. As it’ll do with everything that’s rare or missing in the dataset (and it can’t infer it).

Another thing I did (concerning language) is ask AI about idioms and figures of speech. Initially I did this because I’m not a native english speaker and figures of speech are very nice concepts. They can make your text more flowery or funny, and they always come with some interesting story of origin. But you have to learn and memorize them for later use, because they vary widely from country to country. And LLMs are really good at translating. And they indeed do well with that. And occasionally they’ll hallucinate some idiom. Which can be hilarious. It won’t be something that fits the definition of the term. But it definitely sparks my creativity at times. At least it makes me laugh.

And writing prose and longer stories with AI also shows their preference for likely things. They always try to push my stories towards some lame and common story arcs. Do super obvious plot twists. And lots of models (not all of them) always push towards resolving story arcs and an happy end. And it’s difficult to impossible to overcome. It tends to get better with their size and “intelligence”, but I don’t think any of the current LLMs is close to being useful with that.

So summed up: You said in another comment, computer linguistics discards unique words because they have little value and additionally they get in the way. There is probably some reason to that, computer scientists generally aren’t stupid and I suppose they tried, and put some thought into it. An LLM just can’t make sense of the concept. It needs more training data to learn something. A unique word will just mess with the weights and shift them into some random direction. Likely degrading the LLM in some miniscule way. That’s why they discard them. And even if they didn’t do it, the LLM couldn’t memorize a word if it’s only there once. And if you put it into the dataset multiple times, an LLM could learn it… But it won’t be unique anymore. So I don’t see how it’d work. And also my experience tells me they generally don’t do well with rare things.