EpiGRAF: Rethinking training of 3D GANs

NeurIPS 2022

Ivan Skorokhodov·Sergey Tulyakov·Yiqun Wang·Peter Wonka

KAUSTSnap Inc.

# abstract

A very recent trend in generative modeling is building 3D-aware generators from 2D image collections. To induce the 3D bias, such models typically rely on volumetric rendering, which is expensive to employ at high resolutions. During the past months, there appeared 10+ works (e.g., StyleNeRF, CIPS-3D, StyleSDF, EG3D, MVC-GAN, GIRAFFE-HD, VolumeGAN, etc.) that address this scaling issue by training a separate 2D decoder to upsample a low-resolution image (or a feature tensor) produced from a pure 3D generator. But this solution comes at a cost: not only does it break multi-view consistency (i.e. shape and texture change when the camera moves), but it also learns the geometry at low fidelity. In this work, we show that it is possible to obtain a high-resolution 3D generator with SotA image quality by following a completely different route of simply training the model patch-wise. We revisit and improve this optimization scheme in two ways. First, we design a location- and scale-aware discriminator to work on patches of different proportions and spatial positions. Second, we modify the patch sampling strategy based on an annealed beta distribution to stabilize training and accelerate the convergence. The resulting model, named EpiGRAF, is an efficient, high-resolution, pure 3D generator, and we test it on four datasets (two introduced in this work) at \(256^2\) and \(512^2\) resolutions. It obtains state-of-the-art image quality and high-fidelity geometry, and trains \({\approx} 2.5 \times\) faster than the upsampler-based counterparts.

Architecture

Image
Our generator (left) is purely NeRF-based and uses the tri-plane backbone with the StyleGAN2 decoder (but without the 2D upsampler). Our discriminator (right) is also based on StyleGAN2, but is modulated by the patch location and scale parameters. We use the patch-wise optimization for training with our proposed Beta scale sampling, which allows our model to converge $ imes$2-3 faster than the upsampler-based architectures despite the generator modeling geometry in full resolution.

Geometry visualization on FFHQ 512x512

Our generator models the geometry in the full dataset resolution, which allows it to capture high-fidelity details. In these videos, one can observe that our generator models high-frequency details better: 1) our generator has more detailed hair structure; 2) the eyes and mouth for EG3D are over-smoothed. It is recommended to view these videos in full-screen mode: they have a resolution of ~2048x1024, so as not to omit high-frequency details.

To produce those visualizations, we followed EG3D's pipeline: 1) generated MRC files of the density field at the 512x512x512 volume resolution; 2) visualized in ChimeraX via the turn and record commands. For EG3D, we set the marching cubes level parameter to 10 (as recommended by the repo). We used step=1 (the minimal value) for surface resolution and full lighting for both methods. We used the provided original EG3D checkpoint, named ffhq512-128.pkl. Truncation was set to 0.7 for both generators.

Geometry visualization on Megascans

Image
Our generator models the geometry at the full dataset resolution and is able to fit data where the global structure differs a lot between objects.

Curated samples on FFHQ

Random samples on Cats

Random samples on Megascans Plants

Random samples on Megascans Food

Latent interpolations on Megascans Plants

Latent interpolations on Megascans Food

Curated samples for background separation on FFHQ

In contrast to upsampler-based models, our generator is purely NeRF-based, so it can directly incorporate the advancements from the NeRF literature. In this example, we simply copy-pasted the code from NeRF++ for background separation via the inverse sphere parametrization. For this experiment, we didn't use pose conditioning in the discriminator (which we use for FFHQ and Cats to avoid flat surfaces — otherwise we have the same issues as EG3D and GRAM) and found that when the background separation is enabled, it learns to produce non-flat surfaces on its own, i.e. without direct guidance from the discriminator.

# bibtex

@inproceedings{epigraf,
    title={Epi{GRAF}: Rethinking training of 3D {GAN}s},
    author={Ivan Skorokhodov and Sergey Tulyakov and Yiqun Wang and Peter Wonka},
    booktitle={Advances in Neural Information Processing Systems},
    editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
    year={2022},
    url={https://openreview.net/forum?id=TTM7iEFOTzJ}
}

← back