Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps

ICCV 2025

Chong Cheng*, Sicheng Yu*, Zijian Wang, Yifan Zhou, Hao Wang✉️

The Hong Kong University of Science and Technology (GuangZhou)

Abstract

3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, lack geometric priors in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to scale drift. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: \(\textbf{S3PO-GS}\). Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy.

Marigold training scheme

\(\textbf{Localization and novel view synthesis results on KITTI.}\) Our method S3PO-GS maintains robust tracking and high-quality novel view synthesis even in cases of large-angle turns. This is achieved through our self-consistent 3DGS pointmap tracking and the patch-based pointmap dynamic mapping module.

SLAM System Pipeline

The system begins by initializing a 3D Gaussian map. For new input frame \(T_n\), we rasterize the 3DGS pointmap of the adjacent keyframe \(T_{ak}\), match it with the input image, and establish 2D-3D correspondences to estimate scale self-consistent pose. The estimated pose is further refined using photometric loss. If \(T_n\) is selected as keyframe, we obtain its rendered pointmap \(X^r\) and pre-trained pointmap \(X^p\), then crop both into patches with similar distributions. After patch normalization, the correct points are selected to compute a scaling factor, which is then used to adjust \(X^p\). Once the incorrect points are replaced, \(X^r\) is used to insert new Gaussians. Finally, the aligned pre-trained pointmap is used to jointly optimize the 3D Gaussian map, enabling precise and robust localization and mapping.

Marigold training scheme

Comparison with other methods

We compare our method with other RGB-only SLAM approaches supporting novel view rendering on three dataset. ATE RMSE [m] for tracking; PSNR, SSIM, and LPIPS for novel view rendering. Best results are in \(\textbf{bold}\), second-best in \(\underline{underlined}\). Our method achieves NVS SOTA performance across all datasets, with the best tracking accuracy on KITTI and DL3DV, and comparable tracking accuracy to GlORIE-SLAM on Waymo.

Marigold inference scheme

Novel View Rendering

Our method produces high-fidelity images that capture intricate details of vehicles, streets, and buildings. The rendered depth maps are more accurate in regions with complex depth variations, such as tree branches and roadside vehicles.

Comparison of tracking trajectories

Under large viewpoint changes, MonoGS struggles to track, while OpenGS-SLAM exhibits instability. In contrast, our method achieves superior robustness.

Trajectory Image 1
Trajectory Image 2