3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, lack geometric priors in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to scale drift. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: \(\textbf{S3PO-GS}\). Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy.
\(\textbf{Localization and novel view synthesis results on KITTI.}\) Our method S3PO-GS maintains robust tracking and high-quality novel view synthesis even in cases of large-angle turns. This is achieved through our self-consistent 3DGS pointmap tracking and the patch-based pointmap dynamic mapping module.
The system begins by initializing a 3D Gaussian map. For new input frame \(T_n\), we rasterize the 3DGS pointmap of the adjacent keyframe \(T_{ak}\), match it with the input image, and establish 2D-3D correspondences to estimate scale self-consistent pose. The estimated pose is further refined using photometric loss. If \(T_n\) is selected as keyframe, we obtain its rendered pointmap \(X^r\) and pre-trained pointmap \(X^p\), then crop both into patches with similar distributions. After patch normalization, the correct points are selected to compute a scaling factor, which is then used to adjust \(X^p\). Once the incorrect points are replaced, \(X^r\) is used to insert new Gaussians. Finally, the aligned pre-trained pointmap is used to jointly optimize the 3D Gaussian map, enabling precise and robust localization and mapping.
We compare our method with other RGB-only SLAM approaches supporting novel view rendering on three dataset. ATE RMSE [m] for tracking; PSNR, SSIM, and LPIPS for novel view rendering. Best results are in \(\textbf{bold}\), second-best in \(\underline{underlined}\). Our method achieves NVS SOTA performance across all datasets, with the best tracking accuracy on KITTI and DL3DV, and comparable tracking accuracy to GlORIE-SLAM on Waymo.
Our method produces high-fidelity images that capture intricate details of vehicles, streets, and buildings. The rendered depth maps are more accurate in regions with complex depth variations, such as tree branches and roadside vehicles.
Under large viewpoint changes, MonoGS struggles to track, while OpenGS-SLAM exhibits instability. In contrast, our method achieves superior robustness.