MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry-Texture Collaboration

Nanjie Yao*, Gangjian Zhang*, Wenhao Shen, Jian Shu, Yu Feng, and Hao Wang

Our framework integrates three core components: Texturally, we employ a multi-source texture synthesis strategy to generate diverse synthetic data for training, along with a lightweight texture encoder for effective feature extraction. Geometrically, we introduce a Region-aware Shape Extraction Module that enhances human shape extraction through part-based human shape feature extraction and interaction, coupled with a Fourier Geometry Encoder for efficient geometric learning. Systematically, we propose a Dual Reconstruction U-Net that utilizes feature residuals to balance geometric and texture features, enabling mutual enhancement for cross-modal feature throughout reconstruction. Additionally, to refine 3D mesh quality and extraction efficiency, we design a Gaussian enhanced remeshing strategy based on the supervision of generated normal Gaussian avatar.

Abstract

Monocular 3D clothed human reconstruction aims to generate a complete and realistic textured 3D avatar from a single image. Existing methods are commonly trained under multi-view supervision with annotated geometric priors, and during inference, these priors are estimated by the pre-trained network from the monocular input. These methods are constrained by three key limitations: texturally by unavailability of training data, geometrically by inaccurate external priors, and systematically by biased single-modality supervision, all leading to suboptimal reconstruction. To address these issues, we propose a novel reconstruction framework, named MultiGO++, which consists of three core parts: (1) A multi-sources texture synthesis strategy that constructs 15,000+ 3D textured human scans to improve the performance on texture quality estimation in challenge scenarios; (2) A region-aware shape extraction module that extracts and interacts features of each body region to obtain geometry information and a Fourier geometry encoder that mitigates the modality gap to achieve effective geometry learning; (3) A dual reconstruction U-Net that leverages cross-modal collaborative features to refine and generate high-fidelity textured 3D human meshes. Extensive experiments on two benchmarks and many in-the-wild cases show the superiority of our method over state-of-the-art approaches.

Quantitative Comparison on Geometry Estimation against State-of-the-art Approaches

Quantitative Comparison on Texture Estimation against State-of-the-art Approaches