• Devorlon@lemmy.zip
      link
      fedilink
      English
      arrow-up
      5
      ·
      edit-2
      2 days ago

      I’ve been researching this for uni at you’re not too far off. There’s a bunch of benchmarks out there and LLMs are ran against a set of questions and are given a score based on its response.

      The questions can be multiple choice or open ended. If they’re open then it’ll be marked by another LLM.

      There’s a couple initiatives to create benchmarks with known answers that are updated frequently, so they don’t need to marked by another LLM, but where the questions aren’t in the testing LLMs training dataset. This is because a lot of advancements in LLMs with these benchmarks is just the creators including the text questions and answers in the training data.