LongStream
arXiv Code

LongStream: Long-Sequence Streaming Autoregressive Visual Geometry

Chong Cheng1,2 Xianda Chen1 Tao Xie2,3 Wei Yin2 Weiqiang Ren2 Qian Zhang2 Xiaoyang Guo2 Hao Wang1
1The Hong Kong University of Science and Technology (Guangzhou) 2Horizon Robotics 3Zhejiang University

Demo

18 FPS streaming (autoregressive).
Kilometer-scale sequences.
Metric-scale 3D geometry.
Input
Depth
Main: rendered point cloud video. Left: input RGB and depth. Use “Global 3D” to inspect the stitched point cloud (when available); per-frame point clouds are shown below.

Per-frame point clouds

Frames are loaded on demand (hundreds of frames per scene, no bulk download).

Frame 0000
Download PLY
Loading frame point cloud…

Abstract

Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention degradation over ultra-long sequences and reduces the gap between training and inference. Experiments show LongStream achieves state-of-the-art performance. It delivers stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS.

Method

Streaming architecture with gauge-decoupled pose and cache-consistent training.

LongStream framework overview
Framework overview (click to zoom).

Key ideas

Given streaming inputs, patch tokens are extracted by a ViT encoder and augmented with keyframe, normal-frame, and scale tokens. Tokens are fused via causal attention with a shared KV cache, which is consistently used in both training and inference for cache-consistent streaming modeling. The network predicts keyframe-relative poses Ti←k, depth, pointmap, and global scale, enabling stable metric-scale reconstruction over long sequences.

  • Gauge-decoupled pose learning. Discard first-frame anchoring and predict keyframe-relative poses to keep long-horizon inference stable.
  • Orthogonal metric-scale learning. Fully disentangle geometry from scale estimation to suppress drift.
  • Cache-consistent training + refresh. Use cache-consistent training and periodic cache refresh to mitigate attention sink and long-term KV-cache contamination.

Attention Sink Analysis

How cache-consistent training stabilizes long-horizon streaming inference.

Attention analysis with and without cache-consistent training
Additional attention visualization. We visualize frame-level attention to show how the model uses historical frames during streaming inference. Token attention is aggregated into an S×S frame–frame matrix. We sum over target-frame tokens and average over source-frame tokens. Full-window inference exposes up to 80 visible frames, while sliding-window inference exposes only 10. The batch-trained baseline exhibits a clear temporal bias. It assigns disproportionately high attention to the first frame as an attention sink and also to distant frames, while under-attending recent frames that are most relevant for local geometric consistency. A geometry model should primarily rely on temporally adjacent frames. This bias correlates with rapid RPE growth and unstable long-horizon predictions. Sliding-window inference further reveals a training–inference mismatch. When the first frame remains visible, attention increasingly concentrates on it and recent evidence is ignored, so performance degrades faster. When the first frame is removed, the baseline loses the anchor it has learned to rely on and the rollout collapses. With cache-consistent KV-cache training, attention becomes more balanced and allocates relatively more weight to nearby frames, improving temporal geometric coherence under both window settings. As the effective history approaches 80 frames, attention gradually shifts toward earlier history, consistent with cache saturation. Light blue denotes attention to the keyframe.

Results

Quantitative ATE and qualitative comparisons across long-range and small-scale benchmarks.

What to look at

  • Gauge-decoupled design. Keyframe-relative poses and orthogonal scale decoupling eliminate first-frame anchor dependence and mitigate failures in long-sequence extrapolation.
  • Attention sink and cache contamination. We identify them as primary causes of long-horizon degradation; cache-consistent training and periodic cache refresh stabilize temporal attention and reduce geometric drift.
  • Streaming performance. State-of-the-art accuracy across indoor and outdoor benchmarks, with real-time throughput and stable metric scale on long sequences.
KITTI vKITTI Waymo TUM-RGBD
Memory and runtime comparison
Memory & runtime. LongStream keeps memory and latency stable in streaming settings, while some baselines grow with sequence length and can hit OOM.
KITTI qualitative comparisons
KITTI qualitative comparison. Stable long-range trajectories and geometry under streaming rollout.
TUM-7Scenes qualitative comparisons
TUM / 7Scenes qualitative comparison. Stable geometry and camera trajectories on small indoor scenes.
Table 1. KITTI ATE (lower is better; * = OOM or tracking lost).
Method 00 4542f, 3.7km 01 1101f, 2.5km 02 4661f, 5.1km 03 801f, 0.6km 04 271f, 0.4km 05 2761f, 2.2km 06 1101f, 1.2km 07 1101f, 0.7km 08 4071f, 3.2km 09 1591f, 1.7km 10 1201f, 0.9km Avg.
FastVGGT * 705.39 * 62.38 10.27 157.74 124.43 69.27 * 190.10 194.75 189.29
MASt3R-SLAM * 530.37 * 18.87 88.98 159.430 92.00 * * * * 177.93
VGGT-SLAM * 607.16 * 169.83 13.12 * * * * * * 263.37
CUT3R 185.89 651.52 296.98 148.06 22.17 155.61 132.54 77.03 238.39 205.94 193.39 209.78
TTT3R 190.93 546.84 218.77 105.28 11.62 153.12 132.94 70.95 180.57 211.01 133.00 177.73
STream3R 190.98 681.95 301.40 158.25 102.73 159.85 135.03 90.37 261.15 216.31 207.49 227.77
StreamVGGT 191.93 653.06 303.35 157.50 108.24 160.46 133.71 89.00 263.95 216.69 209.80 226.15
Ours 92.55 46.01 134.70 3.81 1.95 84.69 23.12 14.93 62.07 85.61 21.48 51.90
Table 2. Quantitative comparison on TUM, Oxford Spires, and Waymo (ATE, lower is better).
Method TUM Oxford Spires Waymo
FastVGGT0.41836.5771.281
MASt3R-SLAM0.08237.7287.625
VGGT-SLAM0.12331.0037.431
CUT3R0.54232.4409.396
TTT3R0.30836.2143.486
STream3R0.63337.56942.203
StreamVGGT0.62737.25545.101
Ours0.07619.8150.737
Table 3. Quantitative comparison on vKITTI in terms of ATE (lower is better).
Method Scene 01 447m Scene 02 223m Scene 06 270m Scene 18 339m Scene 20 837m Avg.
FastVGGT3.4350.3110.1202.050101.66731.427
MASt3R-SLAM83.77120.2063.84068.875231.06498.714
VGGT-SLAM25.1280.2370.2811.64168.84023.667
CUT3R50.96829.9130.82029.012127.58355.276
TTT3R29.87711.7850.5987.44571.20828.099
STream3R68.28026.4508.18543.597198.27982.815
StreamVGGT71.61615.34910.27423.900221.40783.916
Ours1.4220.1850.3030.6834.0301.610

BibTeX

Citation information will be provided with the public release.

@misc{longstream,
	  title   = {LongStream: Long-Sequence Streaming Autoregressive Visual Geometry},
	  author  = {Chong Cheng and Xianda Chen and Tao Xie and Wei Yin and Weiqiang Ren and Qian Zhang and Xiaoyang Guo and Hao Wang},
	  year    = {2026},
	  url     = {https://github.com/3DAgentWorld/LongStream},
	}