Key step in generating 3D – ask Stable Diffusion to score views from different a...

shadowgovt · on Oct 6, 2022

I'm modestly surprised that those few angles give us enough data to build out a full 3D render, but I guess I shouldn't be too surprised, as that's tech that has had high demand and been understood for years (those kind of front-cut / side-cut images are what 3D artists use to do their initial prototypes of objects if they're working from real-life models).

nobbis · on Oct 6, 2022

DreamFusion doesn't directly build a 3D model from those generated images. It starts with a completely random 3D voxel model, renders it from 6 different angles, then asks Stable Diffusion how plausible an image of "X, side view" it is.

It then sprinkles some noise on the rendering, makes Stable Diffusion improve it a little, then adjusts the voxels to produce that image (using differentiable rendering.)

Rinse and repeat for hours.

shadowgovt · on Oct 6, 2022

Thank you for the clarification; I hadn't grokked the algorithm yet.

That's interesting for a couple of reasons. I can see why that works. It also implies that for closed objects, the voxel data on the interior (where no images can see it) will be complete noise, as there's no signal to pick any color or lack of a voxel.

nobbis · on Oct 6, 2022

Yes, although not complete noise – probably empty. Haven't checked but assume there's regularization of the NeRF parameters.

FeepingCreature · on Oct 6, 2022

    text = f"{ref_text}, front cutaway drawing"

Maybe?

mhuffman · on Oct 6, 2022

I don't think that NeRFs require too many image to make impressive results.

dwallin · on Oct 6, 2022

Given the way the language model works these words could have multiple meanings. I wonder if training a form of textual inversion to more directly represent these concepts might improve the results. You could even try teaching it to represent more fine grained degree adjustments.