Other than with language models, this has already happened: Take a look at apps such as Merlin Bird ID (identifies birds fairly well by sound and somewhat okay visually), WhoBird (identifies birds by sound, ) Seek (visually identifies plants, fungi, insects, and animals). All of them work offline. IMO these are much better uses of ML than spammer-friendly text generation.
I don’t watch a lot of youtube, but DuckDuckGo browser (on Android and Windows, at least) has a Duck Player that removes all of the cruft around videos and is private afaik.