A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses
Caoyuan Ma, Yu-Lun Liu, Zhixiang Wang, Wu Liu, Xinchen Liu, Zheng Wang
Wuhan University, JD Explore Academy,National Yang Ming Chiao Tung University, The University of Tokyo, National Institute of Informatics

Teaser

HumanNeRF-SE efficiently synthesizes images of performers in diverse poses, blending simplicity with effectiveness. It outperforms previous methods by creating a wider range of new poses (a), maintains generalization without overfitting with limited input frames (b), and requires fewer than 1% of learnable parameters, reducing training time by 95% while delivering superior results in the few-shot scenario (c). LPIPS = 1,000XLPIPS.

Abstract

We present HumanNeRF-SE, a simple yet effective method that synthesizes diverse novel pose images with simple input. Previous HumanNeRF works require a large number of optimizable parameters to fit the human images. Instead, we reload these approaches by combining explicit and implicit human representations to design both generalized rigid deformation and specific non-rigid deformation. Our key insight is that explicit shape can reduce the sampling points used to fit implicit representation, and frozen blending weights from SMPL constructing a generalized rigid deformation can effectively avoid overfitting and improve pose generalization performance. Our architecture involving both explicit and implicit representation is simple yet effective. Experiments demonstrate our model can synthesize images under arbitrary poses with few-shot input and increase the speed of synthesizing images by 15 times through a reduction in computational complexity without using any existing acceleration modules. Compared to the state-of-the-art HumanNeRF studies, HumanNeRF-SE achieves better performance with fewer learnable parameters and less training time.

Video(Coming soon)

Structure

Framework of HumanNeRF-SE.(a) We first voxelize the observation space as a voxel volume $\mathbf{V}$. For a voxel containing vertices, the value will be the number of vertices (as one occupancy channel) and the corresponding SMPL weight. (b) We performed channel-by-channel convolution on the volume. All sampling points are queried in the convolutional volume to get their spatial-aware features. Those points with zero occupancy will be filtered out. (c) We query the nearest weight of the remained points in the volume, which is used for rigid deformation. Spatial-aware features are utilized in the neural network to correct the rigid results and obtain the final point coordinates in the canonical space. The sampling points in the canonical space obtain their colors and densities through the NeRF network. The densities of filtered points are forced to be zero.

Demo

We use ROMP to estimate SMPL as prior from the network video. Due to the fast motion present in dance videos, the estimated SMPL is jittery. Our method is still significantly superior compared to the baseline.

Cite us!

@inproceedings{ma2024humannerfse,
title={HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses},
author={ Ma, Caoyuan and Liu, Yu-Lun and Wang, Zhixiang and Liu, Wu and Liu, Xinchen and Wang, Zheng}
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={1460--1470},
year={2024}
}

Acknowledgements

This project has also received help from Guopeng Li, Runqi Wang, Ziqiao Zhou Xianzheng Ma and Lixiong Chen. Thank you all.

The website template was borrowed from Michaël Gharbi.