Reminds me of the stuff on this wiki page: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet
Idk anything about how the data is collected here or there, but it seems like just basing on URL amplifies the English disproportionality.
Another one with much too much Other.
Especially when two of the named languages (German and French) are around 20th in L1 speakers.
I’m also interested in knowing how they decide what language a URL is in when lots of languages share words, even more so when you remove diacritics like it’s common in URIs. For example, is something like
https://example.org/noticia/n-12345.htmla Portuguese or Spanish URL?I wonder that too. How to separate cross-language homonyms and nonsense words in URLs?
For any individual page, I guess you base it on the page content if the URL language is ambiguous. Like anything with language, feels like it’d be fuzzy and hard to determine.
Not that I necessarily doubt someone has collected the data, just not sure how internet statistics are figured out.

