Reference Sliding Window Attention Architecture

The core technological innovation behind Unlimited OCR is the Reference Sliding Window Attention mechanism. Developed by researchers to address the limitations of autoregressive transformers, this mechanism decouples sequence generation length from key-value cache expansion.

The Problem: Autoregressive KV Cache Expansion

In traditional language-image models, text generation is an autoregressive process. To predict the next character or token, the decoder must perform attention operations across all previously generated tokens. This history is stored in a key-value memory cache to avoid redundant computations.

As the output document length increases—especially when transcribing hundreds of pages of text—the key-value cache size grows linearly. This growth leads to two main problems:

Memory Exhaustion: The memory footprint of the key-value cache exceeds the physical video memory capacity of the GPU, leading to out-of-memory errors.
Computational Slowdown: Calculating attention weights across tens of thousands of tokens increases processing times, causing text generation to grind to a halt.

Due to these bottlenecks, standard systems must process long documents in fragments, clearing the cache between pages and sacrificing cross-page continuity.

The Solution: Reference Sliding Window Attention (RSWA)

Unlimited OCR solves this constraint by restructuring how the attention layer calculates relationships in the decoder. When generating a token, the model splits its attention into two distinct components:

Static Reference Context: The model retains complete visibility of the optical page-image tokens. These tokens represent the document pages and sit static in memory, never expanding.
Dynamic Sliding Window: Instead of attending to the entire history of generated text, the model only attends to the most recent 128 tokens. Any text generated prior to this active window is evicted from the key-value cache.

By evicting tokens older than the 128-token threshold, the active key-value cache size is capped. The computational demand and memory consumption remain constant throughout the entire document parsing run. This architecture allows the system to parse hundreds of pages in a single, uninterrupted forward pass.

Document Image Token Compression

To keep the static reference context manageable, Unlimited OCR utilizes an optimized optical encoder. This encoder compresses a high-resolution page image (1024 by 1024 pixels) down to a compact grid of 256 tokens.

This sixteen-times downsampling ratio preserves critical layout information, including the relative positioning of text lines, tables, and formatting columns. The decoder references these spatial tokens during text generation to ensure accurate positioning, layout extraction, and coordinate mapping.

Mathematical Efficiency

By capping the sequence length of the dynamic text context, the attention calculation complexity per token generation step is reduced from quadratic scaling to constant scaling. The attention computation overhead for token number ten thousand is identical to that of token number one.

This mathematical flatline enables fast execution speeds even when processing massive document structures. Benchmarks show that Unlimited OCR achieves consistent parsing throughput across document horizons of any length, proving the viability of this design for enterprise-scale digitization.

Model Specifications

Model Size: 3 Billion Parameters
Architecture: Mixture of Experts (MoE)
Active Inference Parameters: 500 Million
Supported Sequence Length: Unlimited (via RSWA)
Repository Link: Official GitHub Repo