A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses
Caoyuan Ma, Yu-Lun Liu, Zhixiang Wang, Wu Liu, Xinchen Liu, Zheng Wang
Wuhan University, JD Explore Academy,National Yang Ming Chiao Tung University, The University of Tokyo, National Institute of Informatics

Teaser

Overview. We synthesize images of performers in diverse poses in a simple yet effective way. (a) Novel Pose: In order to test the ability of synthesizing images of diverse poses, we input a simple action video and animate it with novel poses. (b) Few Shot: Furthermore, to test the ability to avoid overfitting, we use only few input to generalize novel pose images. (c) Compared to similar methods, HumanNeRF-SE uses less than 1% of learnable parameters, 1/20 training time, and achieves better results in few-shot experiments. †LPIPS = 1,000×LPIPS.

Abstract

We present HumanNeRF-SE, which can synthesize diverse novel pose images with simple input. Previous HumanNeRF studies require large neural networks to fit the human appearance and prior knowledge. Subsequent methods build upon this approach with some improvements. Instead, we reconstruct this approach, combining explicit and implicit human representations with both general and specific mapping processes. Our key insight is that explicit shape can filter the information used to fit implicit representation, and frozen general mapping combined with point-specific mapping can effectively avoid overfitting and improve pose generalization performance.Our explicit and implicit human represent combination architecture is extremely effective. This is reflected in our model's ability to synthesize images under arbitrary poses with few-shot input and increase the speed of synthesizing images by 15 times through a reduction in computational complexity without using any existing acceleration modules. Compared to the state-of-the-art HumanNeRF studies, HumanNeRF-SE achieves better performance with fewer learnable parameters and less training time (see Figure 1).

Video(Coming soon)

Structure

Framework of HumanNeRF-SE.(a) We first voxelize the observation space as a voxel volume V . For a voxel containing vertices, the value will be the number of vertices (as one occupancy channel) and the corresponding SMPL weight. (b) We performed channel-by-channel convolution on the volume. All sampling points are queried in the convolutional volume to get their spatial-aware features. Those points with zero occupancy will be filtered out. (c) We query the nearest weight of the remained points in the volume, which is used for rigid deformation. Spatial-aware features are utilized in the neural network to correct the rigid results and obtain the final point coordinates in the canonical space. (d) The sampling points in the canonical space obtain their colors and densities through the NeRF network. The densities of filtered points are forced to be zero.

Demo

We use ROMP to estimate SMPL as prior from the network video. Due to the fast motion present in dance videos, the estimated SMPL is jittery. Our method is still significantly superior compared to the baseline.

Cite us!

@inproceedings{ma2024humannerfse,
title={HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses},
author={ Ma, Caoyuan and Liu, Yu-Lun and Wang, Zhixiang and Liu, Wu and Liu, Xinchen and Wang, Zheng}
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}

Acknowledgements

This project has also received help from Guopeng Li, Runqi Wang, Ziqiao Zhou Xianzheng Ma and Lixiong Chen. Thank you all.

The website template was borrowed from Michaël Gharbi.