Reveals why streaming methods fail on long sequences: hard cutoffs, KV-cache saturation, and attention sinks.
HorizonStream Long-Horizon Attention for Streaming 3D Reconstruction
Abstract
Stable 3D streaming beyond10K frames, without reset.
Online 3D reconstruction must estimate camera pose and scene geometry causally with a bounded state. Existing streaming methods often drift, jitter, or collapse on long sequences because their influence patterns mismatch the temporal heterogeneity of geometry: short-lived correspondences and persistent global scale must coexist, yet sliding windows impose hard cutoffs, while ungated recurrence and causal attention can saturate caches and form spike-like attention sinks.
We address this by formalizing geometric propagation as an evidence influence kernel and introducing HorizonStream, a long-horizon Transformer that explicitly factorizes it. Geometric Linear Attention learns channel-wise decay rates for bounded, multi-timescale temporal propagation; Geometric Local Attention with Spatiotemporal RoPE performs reliable short-range 3D matching while suppressing attention sinks; and Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000 frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance.
Models geometric propagation as an evidence influence kernel with temporal, spatial, and metric factors.
Separates long-horizon evidence, local 3D matching, and metric recovery instead of forcing one attention pattern to do all three.
Trains on only 48-frame clips, then scales to 10K+ frames with constant memory and linear time.
Comparison
Qualitative comparison.
Method
Gated geometric propagation.
Geometric Linear Attention
Learns channel-wise gates to retain persistent geometry and discount stale evidence across windows.
Geometric Local Attention
Uses head-wise gates and spatiotemporal RoPE to suppress attention sinks and prevent stale evidence from saturating the recurrent state.
MRT + relative pose fusion
Reads scale and pose from high-retention geometric channels to prevent long-horizon degradation.
Citation
BibTeX
@misc{cheng2026horizonstreamlonghorizonattentionstreaming,
title = {HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction},
author = {Chong Cheng and Peilin Tao and Nanjie Yao and Guanzhi Ding and Xianda Chen and Yuansen Du and Xiaoyang Guo and Wei Yin and Weiqiang Ren and Qian Zhang and Zhengqing Chen and Hao Wang},
year = {2026},
eprint = {2605.23889},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.23889}
}