HorizonStream

HorizonStream Long-Horizon Attention for Streaming 3D Reconstruction

Chong Cheng1,2 Peilin Tao2,3 Nanjie Yao1 Guanzhi Ding1 Xianda Chen4 Yuansen Du2 Xiaoyang Guo2 Wei Yin2 Weiqiang Ren2 Qian Zhang2 Zhengqing Chen2,‡ Hao Wang1,†
1HKUST(GZ) 2Horizon Robotics 3CASIA 4CSU

Abstract

Stable 3D streaming beyond10K frames, without reset.

Online 3D reconstruction must estimate camera pose and scene geometry causally with a bounded state. Existing streaming methods often drift, jitter, or collapse on long sequences because their influence patterns mismatch the temporal heterogeneity of geometry: short-lived correspondences and persistent global scale must coexist, yet sliding windows impose hard cutoffs, while ungated recurrence and causal attention can saturate caches and form spike-like attention sinks.

We address this by formalizing geometric propagation as an evidence influence kernel and introducing HorizonStream, a long-horizon Transformer that explicitly factorizes it. Geometric Linear Attention learns channel-wise decay rates for bounded, multi-timescale temporal propagation; Geometric Local Attention with Spatiotemporal RoPE performs reliable short-range 3D matching while suppressing attention sinks; and Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000 frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance.

01

Reveals why streaming methods fail on long sequences: hard cutoffs, KV-cache saturation, and attention sinks.

02

Models geometric propagation as an evidence influence kernel with temporal, spatial, and metric factors.

03

Separates long-horizon evidence, local 3D matching, and metric recovery instead of forcing one attention pattern to do all three.

04

Trains on only 48-frame clips, then scales to 10K+ frames with constant memory and linear time.

Comparison

Qualitative comparison.

LongStream
LingBot-Map
HorizonStream
LongStream vs GT
LingBot-Map vs GT
HorizonStream vs GT

Method

Gated geometric propagation.

Overview of the HorizonStream framework.
Channel-wise retention

Geometric Linear Attention

Learns channel-wise gates to retain persistent geometry and discount stale evidence across windows.

Head-wise reliability

Geometric Local Attention

Uses head-wise gates and spatiotemporal RoPE to suppress attention sinks and prevent stale evidence from saturating the recurrent state.

Metric consistency

MRT + relative pose fusion

Reads scale and pose from high-retention geometric channels to prevent long-horizon degradation.

Citation

BibTeX

@misc{cheng2026horizonstreamlonghorizonattentionstreaming,
  title         = {HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction},
  author        = {Chong Cheng and Peilin Tao and Nanjie Yao and Guanzhi Ding and Xianda Chen and Yuansen Du and Xiaoyang Guo and Wei Yin and Weiqiang Ren and Qian Zhang and Zhengqing Chen and Hao Wang},
  year          = {2026},
  eprint        = {2605.23889},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.23889}
}