Baidu Open Source Release

Unlimited OCR: High-Performance Multi-Page Document Text Extraction

Process hundreds of pages in a single forward pass at stable speeds with constant key-value cache memory constraints.

Loading Unlimited OCR Live Space...

The Architecture and Impact of Unlimited OCR

Document parsing and text extraction have historically struggled with page length limits. As documents grow from short pages to entire books, standard deep learning models face severe bottlenecks. The processing speed drops, and memory consumption climbs to unsustainable levels. This occurs because typical model architectures must store every generated token in a key-value memory cache, creating a growing computational burden.

To address this limitation, researchers at Baidu introduced a new model known as Unlimited OCR. Built on top of a 3 billion parameter baseline, this system introduces a modern attention design that allows it to process hundreds of pages in a single pass. By employing a sliding memory window, the model prevents the key-value cache from growing indefinitely. This design ensures that page number one hundred is parsed with the same speed and memory footprint as page number one.

This development marks a significant step forward for institutions and companies that manage vast paper archives. Converting physical books, government files, and complex financial reports into structured digital text requires a system that is both fast and resource-efficient. Unlimited OCR meets this need by combining high text-recognition accuracy with a compact runtime footprint, making it suitable for deployment on standard consumer-grade hardware.

Why Standard Document Parsers Slow Down

Typical document parsing systems rely on standard attention layers within their decoding components. In these systems, as each new character or word is output, the model must calculate its relationship to every single word that came before it. This requires keeping all past tokens active in a memory structure called the key-value cache. As a result, the size of this cache grows linearly with the length of the document.

For short documents, this linear growth is manageable. However, when attempting to parse a complete academic paper, a manual, or an entire book in a single pass, the cache becomes so large that it exhausts the available video memory on the graphics processor. Even if the system does not crash, the time required to calculate attention across thousands of tokens increases, causing the processing speed to drop.

To work around this issue, older systems parse documents page by page in a loop. The system loads the image of a single page, extracts the text, clears the memory cache, and then repeats the process for the next page. While this approach keeps memory usage stable, it introduces substantial overhead. It prevents the model from understanding cross-page context, increases total processing time, and complicates the system architecture.

Mathematically, standard attention requires quadratic computation scaling with respect to the generated sequence length. As the sequence extends into tens of thousands of tokens, the attention matrix calculations grow exponentially, consuming processor cycles. Additionally, GPU memory allocation becomes fragmented due to the continuous updates to the key-value cache, leading to frequent memory reallocation delays and hardware slowdowns.

Under heavy cache pressure, system performance degrades as the GPU struggles to fetch older tokens from memory. The constant paging of memory values in and out of active processing units creates a bottleneck that limits real-time text generation. Unlimited OCR directly addresses this issue by rewriting the fundamental memory allocation strategy of the decoding engine.

Memory Allocation Scaling Diagram

Comparison of memory consumption as sequence length increases

Standard Attention System

The key-value cache memory scales quadratically/linearly without limits, leading to GPU out-of-memory states on multi-page files.

Token 1001.2 GB

Token 5,00012.5 GB

Token 20,00024.0+ GB (Crash)

Unlimited OCR (RSWA)

The key-value cache memory remains flat and capped due to token eviction after the 128-token active window.

Token 1006.8 GB

Token 5,0008.2 GB

Token 20,0008.5 GB (Stable)

The RSWA Solution: Keeping the Memory Cache Flat

Unlimited OCR solves this memory bottleneck by replacing standard attention layers with a mechanism called Reference Sliding Window Attention. This mechanism borrows an intuitive idea from how humans transcribe documents by hand. When a person copies text from a page, they keep the entire page image within their field of view. However, they do not need to keep the entire sequence of words they have already written active in their immediate focus. Instead, they only need to focus on the last few words they wrote to maintain flow and grammatical consistency.

In Unlimited OCR, the model replicates this human approach through a two-part attention structure. First, the full document image context remains fully visible to the model throughout the process. This image context is compressed and sits static in memory, never changing. Second, when generating the next text character, the model only attends to a small window of recently generated tokens—specifically the last 128 tokens. Any token generated prior to this window is evicted from the active key-value cache.

By evicting old tokens, the key-value cache stops growing. Its memory footprint remains completely flat throughout the entire run. This means that if the model is generating token number ten or token number ten thousand, the memory usage and the computation required per token are identical. This flat memory design makes one-shot, book-length document parsing practically fast and highly stable.

The attention mask implementation in Reference Sliding Window Attention forces the attention query matrices to only multiply with the keys and values of the most recent 128 tokens, alongside the static page-image keys and values. By structuring the attention calculations in this manner, the computational complexity per step is capped. Developers can process files containing over forty thousand characters without experiencing hardware memory spikes or throttling issues.

This design operates under the assumption that OCR transcription is primarily a local mapping task. The details required to translate a specific line on a page are contained within the corresponding portion of the page image, not in the text that was written pages ago. This insight allows the model to discard historical text tokens without compromising document parsing accuracy.

Reference Sliding Window Attention Schematic

Active attention targets during token generation

Image Context

IMGIMGIMGIMGIMG...IMG256 Page-Image Tokens (Static, locked in memory)

Evicted History

T1T2T3...T9871Discarded from KV Cache (Memory freed)

Active Window

T9872T9873...T9999T10000Last 128 Tokens + Current Generation (Active Attention)

Image Compression and Tokenization

A key component of this architecture is the deep encoder system. The model inherits a highly optimized optical page encoder that processes high-resolution page images. Specifically, a 1024 by 1024 pixels document image is compressed down to just 256 tokens. This represents a sixteen-times compression ratio.

By compressing the complex layout and text patterns of a high-resolution page image into a small number of tokens, the model keeps the primary context manageable. These compressed image tokens represent the static reference that the decoder consults throughout the extraction process. Because these tokens do not change during generation, they do not contribute to cache expansion.

This high compression ratio, combined with Reference Sliding Window Attention, allows the system to hold the context of dozens of pages in memory simultaneously. The model can stream out the parsed text in real time while maintaining a small memory footprint, which is a major advantage for large-scale operations.

The optical encoder relies on a series of downsampling layers that preserve spatial layout information. This means that although the document is compressed into 256 tokens, the relative positions of text blocks, tables, titles, and margins are preserved. When the decoding model outputs text, it references these spatial tokens to determine layout structure, allowing it to output structured data formats alongside the raw text.

This structural preservation ensures that the model does not lose track of column order or reading hierarchy, even when dealing with multi-column formats. The static image reference acts as an anchor, preventing the model from skipping paragraphs or repeating lines, which are common errors in standard autoregressive text parsers.

Model Specifications and Local Deployment

Unlimited OCR is a 3 billion parameter model. It uses a Mixture of Experts architecture. Under this architecture, only a fraction of the model parameters are active during the inference stage. For Unlimited OCR, approximately 500 million parameters are active at any given moment.

This sparse parameter activation translates to high computational efficiency. The model can run locally on standard consumer-grade GPUs. Testing shows that the system consumes under 7 gigabytes of video memory when idle and around 8.5 gigabytes during active text generation. This makes it accessible to individual developers and small businesses running commodity hardware.

The model is integrated with Hugging Face transformers, allowing developers to run it with minimal setup. It can be served locally using Gradio or through SGLang. SGLang is recommended for production environments because it supports high-throughput concurrent serving and creates a standard API endpoint.

By implementing Mixture of Experts, the model balances size with speed. The total 3 billion parameters provide the model with a deep knowledge base for recognizing diverse fonts, complex layouts, and multilingual text. At the same time, the active 500 million parameters ensure that the inference calculations can be completed rapidly on standard GPUs, keeping hosting costs low.

Local hosting is straightforward due to the model's small VRAM footprint. For instance, developers can load the model in 8-bit or 4-bit precision to reduce memory consumption even further, allowing the system to run on laptops or edge devices. The integration with standard Hugging Face pipelines means that setting up a local parsing script requires fewer than twenty lines of Python code.

Mixture of Experts (MoE) Routing Layout

Active parameter routing layout during inference

Total parameters: 3.0 Billion. Active parameters per step: 500 Million (represented by highlighted blocks). This sparse activation keeps VRAM demands low.

Benchmarks: Outperforming Larger Models

In performance evaluations, Unlimited OCR demonstrates high accuracy and speed. On the OmniDocBench version 1.5 benchmark, Unlimited OCR achieved an accuracy score of 93%. This is a significant improvement over its baseline predecessor, which scored 87%. On OmniDocBench version 1.6, Unlimited OCR achieved a score of 94%, outperforming competitor models of similar size.

Furthermore, Unlimited OCR outperforms several larger vision-language models on text-centric tasks. For example, a 35 billion parameter general model scored 89% on the same benchmark, falling behind the 3 billion parameter Unlimited OCR. This shows that specialized models can deliver superior performance on specific tasks while requiring far fewer computational resources.

The OmniDocBench dataset is design-focused, containing thousands of complex pages with academic tables, multi-column research text, diverse fonts, and hand-drawn layout diagrams. High performance on this benchmark indicates that Unlimited OCR handles realistic, messy document formatting. The 6% performance gap over its predecessor highlights that managing the key-value cache efficiently also improves the model's text extraction accuracy.

By maintaining a stable, flat cache, the model avoids context drift. Standard models often lose coherence or repeat text when the key-value cache becomes saturated. Because Unlimited OCR avoids cache saturation, it maintains consistent transcription quality from the beginning of a document to the end, resulting in higher average benchmark scores.

Handling Complex Layouts and Math Formulas

One of the main strengths of Unlimited OCR is its ability to parse complex academic layouts. When processing scientific papers, the model accurately identifies and transcribes tables, columns, and mathematical equations. It outputs mathematical expressions directly in standard LaTeX format, preserving formatting.

In addition to text extraction, the model detects page elements and outputs their bounding box coordinates. Each element is structured with coordinates defining its position on the page. This capability allows developers to reconstruct document layouts, which is valuable for PDF-to-Markdown conversion and layout analysis.

The ability to map coordinates for every parsed word means that Unlimited OCR is not just a text converter, but a complete structural analyzer. Applications can use these coordinates to generate searchable PDFs, highlight specific terms on page images, or separate image figures from body text. This layout metadata is generated in a single pass alongside the text recognition process.

For academic publishing, this layout analysis simplifies the process of digital archiving. Formulas containing fractions, subscripts, and greek letters are converted into clean LaTeX code that can be indexed by search engines. This makes old scientific papers searchable and accessible to modern web technologies.

Real-World Usability and Limitations

Unlimited OCR has been tested on a variety of document types, including printed text, handwritten letters, and multilingual documents. It shows strong performance on clear handwriting, transcribing text and drawing bounding boxes accurately.

However, like all models, it has limits. On highly illegible handwriting or complex medical prescriptions, the system can make transcription errors. In these scenarios, human validation is recommended to ensure data accuracy. In spite of these edge cases, the model's speed and efficiency make it a powerful tool for document digitization.

Users must also account for limitations in document orientation and scan quality. Extremely low-resolution scans, where characters are pixelated, or pages rotated at severe angles may experience lower recognition accuracy. For optimal results, preprocessing images to straighten pages and enhance contrast is recommended.

Nevertheless, the combination of high speed, small memory demand, and layout coordinate output makes Unlimited OCR an exceptional open asset. It provides developers with the capacity to build scalable document processing pipelines without incurring high cloud API fees or investing in expensive local hardware infrastructure.

Summary of Benefits

In summary, Unlimited OCR provides several key advantages for document parsing:

Constant Speed: Processing speed remains stable even as document length increases.
Flat Memory: The key-value cache memory footprint does not grow during generation.
Low Hardware Barrier: Runs on consumer GPUs with under 9 gigabytes of VRAM.
High Accuracy: Outperforms larger models on key document parsing benchmarks.
MIT License: Free to use, modify, and distribute for commercial applications.

Frequently Asked Questions

Select a question from the dropdown menu below:

Or explore all questions in detail:

Unlimited OCR represents a significant departure from standard systems by addressing the memory and speed bottlenecks associated with multi-page documents. In typical models, analyzing a long text causes the key-value cache to expand continuously, slowing down the processor and consuming vast amounts of memory. This model implements Reference Sliding Window Attention to evict old text tokens from active memory while preserving the core document image context, maintaining a constant memory footprint and stable processing speed from the first page to the hundredth.

The core innovation in Unlimited OCR is Reference Sliding Window Attention. When generating text characters, the system only looks at two specific components: the full page image representation, which is kept static in memory, and the last 128 generated text tokens. Any token older than this window is evicted from the active memory cache. This matches how humans copy long pages by hand, focusing on the immediate characters they are writing while referencing the page image. This limits key-value cache growth and keeps computation costs constant.

Yes. Unlimited OCR is a 3 billion parameter model built on a Mixture of Experts architecture. During the active run, only 500 million parameters are loaded for active processing. In practice, the model consumes less than 7 gigabytes of video memory when idle and around 8.5 gigabytes of video memory when actively processing a long document. This low hardware requirement allows it to run smoothly on standard consumer hardware and public research environments like Google Colab without needing enterprise-grade compute setups.

On standard text extraction benchmarks like OmniDocBench version 1.5, Unlimited OCR achieves a high score of 93%, which is 6% higher than its baseline predecessor. On OmniDocBench version 1.6, the model scores nearly 94%, surpassing other open models. This performance makes it highly competitive, even when compared to much larger general purpose language-image models.

Yes. Because the baseline architecture inherited from its predecessor has strong multilingual training, Unlimited OCR excels at recognizing multiple languages. Experimental runs demonstrate accurate extraction in over 30 languages, including Arabic, Hindi, Urdu, Southeast Asian, and various European languages, without needing explicit language tags beforehand.

Unlimited OCR extracts formatted tables, mathematical formulas, and academic layouts with high precision. It outputs complex mathematical expressions directly in standard LaTeX notation, preserving the formatting structure. Furthermore, the model outputs bounding box coordinates alongside the text, identifying the exact position of elements on the document page.

For production environments, deploying via SGLang is highly recommended. Serving the model through SGLang creates an open-compatible API endpoint that handles high-throughput requests efficiently. For local testing, running the model directly through Hugging Face transformers with a simple Gradio web interface is the fastest path.

Yes, the model is fully open source. It is released under the liberal MIT license, which allows developers and organizations to download, modify, integrate, and distribute the model for both commercial and non-commercial applications.

Sliding window attention limits the model's ability to reference text generated far in the past. However, since OCR is a page-by-page mapping task where text correlates directly with the current page image, the model does not need to recall text from page one when transcribing page ten. The static image reference provides the necessary global positioning, so losing the distant text history does not degrade extraction quality.

In the model configurations on Hugging Face, developers can adjust the attention window parameters in the configuration files. A window of 128 tokens is chosen because it balances local text context with GPU efficiency, but for specific tasks, this parameter can be tuned depending on the memory constraints and document structure.