I do agree with your “averaging machine” argument. It makes a lot of sense given how LLMs are trained as essentially massive statistical models.
For image generation models I think a good analogy is to say it’s not drawing, but rather sculpting - it starts with a big block of white noise and then takes away all the parts that don’t look like the prompt. Iterate a few times until the result is mostly stable (that is it can’t make the input look much more like the prompt than it already does). It’s why you can get radically different images from the same prompt - the starting block of white noise is different, so which parts of that noise look most prompt-like and so get emphasized are going to be different.
For image generation models I think a good analogy is to say it’s not drawing, but rather sculpting - it starts with a big block of white noise and then takes away all the parts that don’t look like the prompt. Iterate a few times until the result is mostly stable (that is it can’t make the input look much more like the prompt than it already does). It’s why you can get radically different images from the same prompt - the starting block of white noise is different, so which parts of that noise look most prompt-like and so get emphasized are going to be different.