The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility.

Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project’s GitHub, creator Robyn Speer wrote that the project “will not be updated anymore.”

  • Lvxferre@mander.xyz
    link
    fedilink
    English
    arrow-up
    12
    arrow-down
    1
    ·
    3 months ago

    At least in theory you could still do NLP from online sources, but the sheer amount of work necessary to ensure that you got the bots out makes it unfeasible.

    So I don’t want to work on anything that could be confused with generative AI, or that could benefit generative AI.

    Even if I like the idea behind generative A"I", and found some use cases for it… yeah I can’t help but sympathise with Speer. Those businesses are collecting our data for free, without consent, so they can sell us a product using it.

    • T156@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      3 months ago

      At least in theory you could still do NLP from online sources, but the sheer amount of work necessary to ensure that you got the bots out makes it unfeasible.

      Not just that, but the increasing number of sites blocking or having countermeasures against the tools they use also increases the amount of work/makes it harder.

      Several years ago, it would have been easy and cheap to noodle up a quick Twitter or Reddit bot to churn through posts and spit out the posts on the other side. These days, you need to pay for that, and in some cases, pay quite a lot.

      X (formerly known as Twitter), for example, wants to charge $100/month, and Reddit wants $0.24 per 100 API calls.

      You can scrape, of course, but that risks getting you banned, if you’re not going to run into barriers. The website formerly known as Twitter no longer allows you to see parent tweets, nor replies if you’re not logged in, for example.