For what I think you have in mind, I suspect it will eventually not be "image to...

For what I think you have in mind, I suspect it will eventually not be "image to image", but "<ai thing> to <ai thing> + image", for that guess to be remotely repeatable. That "ai thing" is probably the persistence you're talking about.

I think that thing will neccesarily contained representations of dimension, behavior (physics/bones), and "style". Without the "ai thing", if only using an image/text, the character would have to be impossibly represented in the model, so it could guess all of these things predictably. For example, what that character looks like from a side profile, or behind. What if it's an alien, and its arms should always bend backwards? Could a text representation ever be made to completely describe this, with good reproducibility? Probably not. But, I assume some non-human representation would have a better chance.

As is, if something known is required, I think the behavior of these models can be considered "destructive" to the input image, more often than not. For this reason, I think artists are safe, for the time being. :)