Local Setup & Installation Guide

This document provides a step-by-step walkthrough to run Unlimited OCR on your local system, using either the Hugging Face transformers library or high-performance serving frameworks.

Prerequisites

Before initializing the model, ensure your hardware matches the minimum requirements. You will need an Nvidia graphics processor card with at least 8.5 gigabytes of VRAM to run the model at full precision.

Create a clean Python virtual environment and run the following command to install the required library packages:

Terminal Shell

pip install torch transformers accelerate sentencepiece gradio

Python Local Run Pipeline

The following Python script loads the model in float16 precision and executes a text extraction task on a local page image. The system will download the model weights automatically from the Hugging Face repository upon first execution.

run_ocr.py

import torch
from transformers import AutoModel, AutoTokenizer

# Load the model and tokenizer
model_id = "baidu/Unlimited-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, 
    device_map="auto"
)

# Place model in evaluation mode
model.eval()

# Load image and set document query
image_path = "document_page.jpg"
prompt = "document parsing"

# Run inference
with torch.no_grad():
    response = model.chat(
        tokenizer=tokenizer,
        image_path=image_path,
        prompt=prompt
    )

print("Extracted Text Output:")
print(response)

Running the Gradio Web Interface

To run a local browser application with a drag-and-drop file uploader, you can run a simple Gradio application. This allows you to select document files, view the extraction results in real time, and extract bounding box coordinates.

A template script wrapper is available in the repository files. Run the application from your terminal:

Terminal Shell

python app.py

Production Deployment with SGLang

For enterprise workloads where processing latency and concurrent stream scaling are essential, running the model via transformers is not recommended. Instead, serve the model using SGLang.

SGLang serves the model weights via an optimized server that exposes an open-compatible API endpoint. This deployment path reduces memory allocation overhead and ensures high throughput when transcribing hundreds of files in parallel.

For advanced setup flags and options, visit the Official GitHub Repository.