I'm not trying to produce a specific picture, I'm trying to explore how much of the prompt is actually parsed and how phrases actually control the generation. So far it seems to be "not very much". Individual keywords make it likely that there will be specific elements in the picture, but those keywords are phrases of not more than three or four words and will be collapsed to single words if it's even vaguely inconvenient.
That is, I'm more interested if there's any meaning to the generator in "a humanoid fox painting a picture of thing that thing things", and how many of those qualifiers make it through the process.
Until recently the two word phrase "humanoid animal-name" was interpreted as "animal-name" and "animal-name doing thing" more often than not produced a picture of "a human doing thing" with an animal in there somewhere.
So when "a picture of a humanoid ferret wearing a fedora and a trenchcoat, leaning against a lamppost while staking out a gin joint" gets the humanoid ferret and kind of the right clothes, and a lamp-post... it's still not very good but it's way better than it has been.