HorizonStream Long-Horizon Attention for Streaming 3D Reconstruction

Chong Cheng^1,2 Peilin Tao^2,3 Nanjie Yao¹ Guanzhi Ding¹ Xianda Chen⁴ Yuansen Du² Xiaoyang Guo² Wei Yin² Weiqiang Ren² Qian Zhang² Zhengqing Chen^2,‡ Hao Wang^1,†

¹HKUST(GZ) ²Horizon Robotics ³CASIA ⁴CSU

Paper ArXiv Model GitHub Demo

RGB Stream

Camera Pose

Depth

Point Cloud Render

Abstract

Stable 3D streaming beyond10K frames, without reset.

Online 3D reconstruction must estimate camera pose and scene geometry causally with a bounded state. Existing streaming methods often drift, jitter, or collapse on long sequences because their influence patterns mismatch the temporal heterogeneity of geometry: short-lived correspondences and persistent global scale must coexist, yet sliding windows impose hard cutoffs, while ungated recurrence and causal attention can saturate caches and form spike-like attention sinks.

We address this by formalizing geometric propagation as an evidence influence kernel and introducing HorizonStream, a long-horizon Transformer that explicitly factorizes it. Geometric Linear Attention learns channel-wise decay rates for bounded, multi-timescale temporal propagation; Geometric Local Attention with Spatiotemporal RoPE performs reliable short-range 3D matching while suppressing attention sinks; and Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000 frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance.

Reveals why streaming methods fail on long sequences: hard cutoffs, KV-cache saturation, and attention sinks.

Models geometric propagation as an evidence influence kernel with temporal, spatial, and metric factors.

Separates long-horizon evidence, local 3D matching, and metric recovery instead of forcing one attention pattern to do all three.

Trains on only 48-frame clips, then scales to 10K+ frames with constant memory and linear time.

Comparison

Qualitative comparison.

LongStream

LingBot-Map

HorizonStream

LingBot-Map HorizonStream

LingBot-Map begins to lose accuracy and jitter as the input stream grows longer, while HorizonStream maintains stable accuracy under longer inputs.

Method

Gated geometric propagation.

Overview of the HorizonStream framework.

Channel-wise retention

Geometric Linear Attention

Learns channel-wise gates to retain persistent geometry and discount stale evidence across windows.

Head-wise reliability

Geometric Local Attention

Uses head-wise gates and spatiotemporal RoPE to suppress attention sinks and prevent stale evidence from saturating the recurrent state.

Metric consistency

MRT + relative pose fusion

Reads scale and pose from high-retention geometric channels to prevent long-horizon degradation.

Citation

BibTeX

@misc{cheng2026horizonstreamlonghorizonattentionstreaming,
  title         = {HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction},
  author        = {Chong Cheng and Peilin Tao and Nanjie Yao and Guanzhi Ding and Xianda Chen and Yuansen Du and Xiaoyang Guo and Wei Yin and Weiqiang Ren and Qian Zhang and Zhengqing Chen and Hao Wang},
  year          = {2026},
  eprint        = {2605.23889},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.23889}
}