I'm modestly surprised that those few angles give us enough data to build out a full 3D render, but I guess I shouldn't be too surprised, as that's tech that has had high demand and been understood for years (those kind of front-cut / side-cut images are what 3D artists use to do their initial prototypes of objects if they're working from real-life models).
DreamFusion doesn't directly build a 3D model from those generated images.
It starts with a completely random 3D voxel model, renders it from 6 different angles, then asks Stable Diffusion how plausible an image of "X, side view" it is.
It then sprinkles some noise on the rendering, makes Stable Diffusion improve it a little, then adjusts the voxels to produce that image (using differentiable rendering.)
Thank you for the clarification; I hadn't grokked the algorithm yet.
That's interesting for a couple of reasons. I can see why that works. It also implies that for closed objects, the voxel data on the interior (where no images can see it) will be complete noise, as there's no signal to pick any color or lack of a voxel.
Given the way the language model works these words could have multiple meanings. I wonder if training a form of textual inversion to more directly represent these concepts might improve the results. You could even try teaching it to represent more fine grained degree adjustments.