LongStream: Long-Sequence Streaming Autoregressive Visual Geometry

Chong Cheng^1,2 Xianda Chen¹ Tao Xie^2,3 Wei Yin² Weiqiang Ren² Qian Zhang² Xiaoyang Guo² Hao Wang¹

¹The Hong Kong University of Science and Technology (Guangzhou) ²Horizon Robotics ³Zhejiang University

arXiv Code Demo Model

Demo

18 FPS streaming (autoregressive).

Kilometer-scale sequences.

Metric-scale 3D geometry.

Input

Depth

Main: rendered point cloud video. Left: input RGB and depth. Use “Global 3D” to inspect the stitched point cloud (when available); per-frame point clouds are shown below.

Per-frame point clouds

Frames are loaded on demand (hundreds of frames per scene, no bulk download).

Frame 0000

Frame point size Download PLY

Loading frame point cloud…

Abstract

Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention degradation over ultra-long sequences and reduces the gap between training and inference. Experiments show LongStream achieves state-of-the-art performance. It delivers stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS.

Method

Streaming architecture with gauge-decoupled pose and cache-consistent training.

LongStream framework overview — Framework overview (click to zoom).

Key ideas

Given streaming inputs, patch tokens are extracted by a ViT encoder and augmented with keyframe, normal-frame, and scale tokens. Tokens are fused via causal attention with a shared KV cache, which is consistently used in both training and inference for cache-consistent streaming modeling. The network predicts keyframe-relative poses T_i←k, depth, pointmap, and global scale, enabling stable metric-scale reconstruction over long sequences.

Gauge-decoupled pose learning. Discard first-frame anchoring and predict keyframe-relative poses to keep long-horizon inference stable.
Orthogonal metric-scale learning. Fully disentangle geometry from scale estimation to suppress drift.
Cache-consistent training + refresh. Use cache-consistent training and periodic cache refresh to mitigate attention sink and long-term KV-cache contamination.

Attention Sink Analysis

How cache-consistent training stabilizes long-horizon streaming inference.

Attention analysis with and without cache-consistent training — **Additional attention visualization.** We visualize frame-level attention to show how the model uses historical frames during streaming inference. Token attention is aggregated into an S×S frame–frame matrix. We sum over target-frame tokens and average over source-frame tokens. Full-window inference exposes up to 80 visible frames, while sliding-window inference exposes only 10. The batch-trained baseline exhibits a clear temporal bias. It assigns disproportionately high attention to the first frame as an attention sink and also to distant frames, while under-attending recent frames that are most relevant for local geometric consistency. A geometry model should primarily rely on temporally adjacent frames. This bias correlates with rapid RPE growth and unstable long-horizon predictions. Sliding-window inference further reveals a training–inference mismatch. When the first frame remains visible, attention increasingly concentrates on it and recent evidence is ignored, so performance degrades faster. When the first frame is removed, the baseline loses the anchor it has learned to rely on and the rollout collapses. With cache-consistent KV-cache training, attention becomes more balanced and allocates relatively more weight to nearby frames, improving temporal geometric coherence under both window settings. As the effective history approaches 80 frames, attention gradually shifts toward earlier history, consistent with cache saturation. Light blue denotes attention to the keyframe.

Results

Quantitative ATE and qualitative comparisons across long-range and small-scale benchmarks.

What to look at

Gauge-decoupled design. Keyframe-relative poses and orthogonal scale decoupling eliminate first-frame anchor dependence and mitigate failures in long-sequence extrapolation.
Attention sink and cache contamination. We identify them as primary causes of long-horizon degradation; cache-consistent training and periodic cache refresh stabilize temporal attention and reduce geometric drift.
Streaming performance. State-of-the-art accuracy across indoor and outdoor benchmarks, with real-time throughput and stable metric scale on long sequences.

KITTI vKITTI Waymo TUM-RGBD

Memory and runtime comparison — **Memory & runtime.** LongStream keeps memory and latency stable in streaming settings, while some baselines grow with sequence length and can hit OOM.

KITTI qualitative comparisons — **KITTI qualitative comparison.** Stable long-range trajectories and geometry under streaming rollout.

TUM-7Scenes qualitative comparisons — **TUM / 7Scenes qualitative comparison.** Stable geometry and camera trajectories on small indoor scenes.

Table 1. KITTI ATE (lower is better; * = OOM or tracking lost).

Method	00 4542f, 3.7km	01 1101f, 2.5km	02 4661f, 5.1km	03 801f, 0.6km	04 271f, 0.4km	05 2761f, 2.2km	06 1101f, 1.2km	07 1101f, 0.7km	08 4071f, 3.2km	09 1591f, 1.7km	10 1201f, 0.9km	Avg.
FastVGGT	*	705.39	*	62.38	10.27	157.74	124.43	69.27	*	190.10	194.75	189.29
MASt3R-SLAM	*	530.37	*	18.87	88.98	159.430	92.00	*	*	*	*	177.93
VGGT-SLAM	*	607.16	*	169.83	13.12	*	*	*	*	*	*	263.37
CUT3R	185.89	651.52	296.98	148.06	22.17	155.61	132.54	77.03	238.39	205.94	193.39	209.78
TTT3R	190.93	546.84	218.77	105.28	11.62	153.12	132.94	70.95	180.57	211.01	133.00	177.73
STream3R	190.98	681.95	301.40	158.25	102.73	159.85	135.03	90.37	261.15	216.31	207.49	227.77
StreamVGGT	191.93	653.06	303.35	157.50	108.24	160.46	133.71	89.00	263.95	216.69	209.80	226.15
Ours	92.55	46.01	134.70	3.81	1.95	84.69	23.12	14.93	62.07	85.61	21.48	51.90

Table 2. Quantitative comparison on TUM, Oxford Spires, and Waymo (ATE, lower is better).

Method	TUM	Oxford Spires	Waymo
FastVGGT	0.418	36.577	1.281
MASt3R-SLAM	0.082	37.728	7.625
VGGT-SLAM	0.123	31.003	7.431
CUT3R	0.542	32.440	9.396
TTT3R	0.308	36.214	3.486
STream3R	0.633	37.569	42.203
StreamVGGT	0.627	37.255	45.101
Ours	0.076	19.815	0.737

Table 3. Quantitative comparison on vKITTI in terms of ATE (lower is better).

Method	Scene 01 447m	Scene 02 223m	Scene 06 270m	Scene 18 339m	Scene 20 837m	Avg.
FastVGGT	3.435	0.311	0.120	2.050	101.667	31.427
MASt3R-SLAM	83.771	20.206	3.840	68.875	231.064	98.714
VGGT-SLAM	25.128	0.237	0.281	1.641	68.840	23.667
CUT3R	50.968	29.913	0.820	29.012	127.583	55.276
TTT3R	29.877	11.785	0.598	7.445	71.208	28.099
STream3R	68.280	26.450	8.185	43.597	198.279	82.815
StreamVGGT	71.616	15.349	10.274	23.900	221.407	83.916
Ours	1.422	0.185	0.303	0.683	4.030	1.610

BibTeX

Citation information will be provided with the public release.

@misc{longstream,
	  title   = {LongStream: Long-Sequence Streaming Autoregressive Visual Geometry},
	  author  = {Chong Cheng and Xianda Chen and Tao Xie and Wei Yin and Weiqiang Ren and Qian Zhang and Xiaoyang Guo and Hao Wang},
	  year    = {2026},
	  url     = {https://github.com/3DAgentWorld/LongStream},
	}