Using Nerf Depth smoothness for improving Human Model Generation

From UBC Wiki

Title

This project aims to improve the performance of 3D Human image generation using recent Nerf optimizations

Principal Author: Anubhav Garg

Abstract

The field of 3D human generation in computer graphics focuses on the creation of realistic and detailed three-dimensional models of human bodies. This involves utilizing various techniques, such as computer-aided design (CAD), digital sculpting, and generative models, to generate virtual representations of human anatomy, including geometry, texture, and appearance. The applications of 3D human generation are diverse, ranging from the entertainment industry for video games, movies, and virtual reality experiences, to fields such as medical simulations, virtual clothing fitting, and biomechanics research. Advanced techniques in 3D human generation leverage deep learning, neural rendering, and GANs to achieve high-quality and diverse human models with accurate geometry, realistic textures, and lifelike animations. Recently, EVA3D [1] utilizes a Nerf [2] model to generate state-of-the-art human body samples. However, there are still limitations like edge differentiation and lack of color diversity. The authors of [3] proposes depth smoothness regularization which can be utilized for edge smoothness. In this project, our aim is to use that for improving the human body generation.

Builds on

This page builds on models like Nerf, GANs, SMPL model and similar generative AI concepts.

Related Pages

Generative Adversarial Networks , have been used since many years for image generation, and Conditional GANs, for image translation. 3D avatars generation is also used in many works like AvatarCLIP

Content

Background

Neural Radiance Fields (Nerf) is a model to generate novel views of an object or scene from many input views from different angles. In Nerf, the volumetric representation of the scene is optimized as a vector-valued function for continuous 5D coordinates, which consist of location and view direction. This representation is parameterized as a fully connected deep network that outputs volume density and view-dependent emitted RGB radiance for each 5D coordinate. Techniques from volume rendering are used to composite these values along a camera array, allowing for rendering of any pixel. The rendering process is fully differentiable, enabling optimization of the scene representation by minimizing the error of rendering all camera rays from a collection of standard RGB images.

Since Nerf was introduced, 3D synthetic data generation has been advancing rapidly. There have been modifications to the original Nerf model as it used to take a lot of time to train. Also, original Nerf used to take around 30-100 images of a single scene or object to finally train a model from which we can sample any views from any angle.

GANs are also being used for 3D generation. In contrast to 2D GANs, 3D GANs utilize a combination of a 3D-structure-aware inductive bias in the generator network architecture and a Nerf to ensure view-consistent results. This inductive bias can be represented using explicit voxel grids or neural implicit representations, which have shown success in single-scene "overfitting" scenarios. However, these representations are not suitable for training high-resolution 3D GANs due to their inefficiency in terms of memory and computation. State-of-the-art neural volume rendering at high-resolutions using these representations is computationally infeasible, as training a 3D GAN requires rendering tens of millions of images. The issue of generating 3D-aware human models from 2D images is addressed by EVA3D. In order to achieve high-resolution generation, EVA3D suggests utilizing compositional multiple Nerfs as the generation approach.

EVA3D

To generate high-quality 3D human models, it is essential to address two key factors: 1) the representation of 3D humans, and 2) the training strategies for generative networks. Many existing 3D-aware GANs rely on static volume modeling, which fails to capture the articulated nature of human bodies adequately. Our proposal involves an articulated 3D human representation that allows explicit control over the pose and shape of the human model. This enables flexibility in rendering different poses and shapes in observation space while modeling the human in its canonical pose in canonical space. Efficiency of the representation is crucial for achieving high-quality 3D human generation, as previous methods have struggled with low resolution due to inefficient human representations.

In addition to the representation, training strategies have a significant impact on the performance of 3D human generative models. The choice of dataset, for instance, plays a crucial role. Fashion datasets, such as DeepFashion [4], which closely align with real-world human image distributions, are often preferred over datasets with limited diversity. However, fashion datasets may have limitations in terms of human poses and viewing angles, which can pose challenges in unsupervised learning of 3D GANs and novel view/pose synthesis. Hence, careful consideration of training strategies is necessary to address these issues.

Annotated Bibliography

[1] EVA3D : https://arxiv.org/pdf/2210.04888.pdf

[2] Nerf : https://arxiv.org/abs/2003.08934

[3] https://m-niemeyer.github.io/regnerf/index.html

[4] https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.


Some rights reserved
Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0
By-nc-sa-small-transparent.png