DEV Community

Xiao Ling
Xiao Ling

Posted on • Originally published at dynamsoft.com

How to Build a Windows Desktop Document Scanner with OCR and Layout Analysis in Python

Paper documents still enter desktop workflows through scanners, but many sample apps stop at acquisition and never connect the pages to a practical OCR workflow. This project fills that gap with a Windows desktop application that scans documents through Dynamic Web TWAIN Service, runs local layout analysis, and lets users compare multiple OCR backends inside one PySide6 interface.

The current app supports three OCR options:

  • OneOCR for Windows-native OCR with text boxes.
  • PaddleOCR for fully local OCR with returned text coordinates.
  • Ollama (GLM-OCR) as an optional backend when you want to experiment with VLM-style OCR.

Ollama is intentionally optional here. On a CPU-only PC, it may be much slower than the other local OCR choices.

What you'll build: A Windows PySide6 desktop app that scans paper documents, visualizes layout regions, and runs full-page or region-level OCR with Dynamic Web TWAIN Service, PP-DocLayoutV3, OneOCR, PaddleOCR, and an optional Ollama backend.

Demo Video: Windows Document Scanner and OCR App

Prerequisites

  • Windows with Python 3.11 or later.
  • PySide6, Pillow, PyTorch, transformers, NumPy, and twain-wia-sane-scanner.
  • Dynamic Web TWAIN Service installed and reachable at http://127.0.0.1:18622.
  • A Dynamic Web TWAIN license key for scanner access.

Optional OCR backends:

  • OneOCR on Windows, plus its runtime files from Windows 11 Snipping Tool.
  • PaddleOCR with either paddle_static or transformers inference.
  • Ollama with glm-ocr:latest if you want to test an LLM OCR backend.

Get a 30-day free trial license

Step 1: Install Python Dependencies and Configure the OCR Environment

Install the same packages used by the sample project.

python -m pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

If you want the PP-DocLayoutV3 layout model to run fully offline, download it once before launching the app:

huggingface-cli download PaddlePaddle/PP-DocLayoutV3_safetensors --local-dir D:\models\PP-DocLayoutV3_safetensors

$env:LAYOUT_MODEL_DIR="D:\models\PP-DocLayoutV3_safetensors"
$env:LAYOUT_LOCAL_ONLY="1"
Enter fullscreen mode Exit fullscreen mode

The current dependency set is:

PySide6>=6.5
Pillow>=9.0
torch>=2.0
transformers>=4.0
numpy>=1.20
twain-wia-sane-scanner>=2.0.3
oneocr>=1.0.12; platform_system == "Windows"
paddleocr>=3.6.0
Enter fullscreen mode Exit fullscreen mode

If you want to enable the optional Ollama backend as well:

ollama pull glm-ocr:latest
Enter fullscreen mode Exit fullscreen mode

Step 2: Define OCR Engine Keys and Runtime Defaults

The Python app defines explicit OCR engine keys and lets the UI switch between them.

OCR_ENGINE_OLLAMA = "ollama"
OCR_ENGINE_ONEOCR = "oneocr"
OCR_ENGINE_PADDLEOCR = "paddleocr"
OCR_ENGINE_LABELS = {
    OCR_ENGINE_OLLAMA: "Ollama (GLM-OCR)",
    OCR_ENGINE_ONEOCR: "OneOCR",
    OCR_ENGINE_PADDLEOCR: "PaddleOCR",
}


def _default_paddleocr_engine() -> str:
    if importlib.util.find_spec("paddle") is not None:
        return "paddle_static"
    return "transformers"


PADDLEOCR_ENGINE = os.environ.get("PADDLEOCR_ENGINE", _default_paddleocr_engine())
PADDLEOCR_DEVICE = os.environ.get(
    "PADDLEOCR_DEVICE",
    "gpu" if torch.cuda.is_available() else "cpu",
)
PADDLEOCR_DET_MODEL = os.environ.get("PADDLEOCR_DET_MODEL", "PP-OCRv5_mobile_det")
PADDLEOCR_REC_MODEL = os.environ.get("PADDLEOCR_REC_MODEL", "PP-OCRv5_mobile_rec")
LAYOUT_MODEL_DIR = os.environ.get("LAYOUT_MODEL_DIR", "").strip()
LAYOUT_LOCAL_ONLY = os.environ.get("LAYOUT_LOCAL_ONLY", "").strip().lower() in {
    "1", "true", "yes", "on"
}
Enter fullscreen mode Exit fullscreen mode

This keeps desktop defaults practical. In particular, PaddleOCR now prefers the lighter PP-OCRv5 mobile detection and recognition models, which are much more suitable than heavier defaults on a typical PC.

For layout analysis, the same idea applies: you can either let transformers resolve PaddlePaddle/PP-DocLayoutV3_safetensors through the default endpoint or mirror you provide, or you can point the app at a pre-downloaded local directory and run without network access.

Step 3: Build a Toolbar with Separate Layout and OCR Actions

The current UI makes layout analysis a first-class action. That keeps the workflow explicit and avoids surprising users when OCR silently changes the overlay.

self.layout_action = QAction("Layout Analysis", self)
self.layout_action.setIcon(QIcon.fromTheme("insert-object"))
self.layout_action.triggered.connect(self._on_layout_clicked)
self.layout_action.setEnabled(False)
toolbar.addAction(self.layout_action)

self.ocr_action = QAction("Run OCR", self)
self.ocr_action.setIcon(QIcon.fromTheme("find"))
self.ocr_action.triggered.connect(self._on_ocr_clicked)
self.ocr_action.setEnabled(False)
toolbar.addAction(self.ocr_action)

self.ocr_engine_combo = QComboBox()
self.ocr_engine_combo.currentIndexChanged.connect(self._on_ocr_engine_changed)
toolbar.addWidget(self.ocr_engine_combo)

self.ocr_mode_label = QLabel("Mode:")
toolbar.addWidget(self.ocr_mode_label)

self.ocr_mode_combo = QComboBox()
self.ocr_mode_combo.addItems([
    "Text Recognition", "Table Recognition", "Figure Recognition",
])
toolbar.addWidget(self.ocr_mode_combo)
Enter fullscreen mode Exit fullscreen mode

The Mode control is only relevant to Ollama (GLM-OCR), so the app hides it for the other engines instead of pretending every OCR backend shares the same prompt model.

Step 4: Run Layout Analysis Independently Before OCR

layout analysis

Layout Analysis runs PP-DocLayoutV3 by itself and draws an overlay of document regions without triggering any OCR pass.

class LayoutAnalysisWorker(QThread):
    layout_ready = Signal(LayoutResult)
    progress = Signal(str)
    error = Signal(str)

    def __init__(self, image_path: str, parent=None):
        super().__init__(parent)
        self.image_path = image_path

    def run(self):
        self.progress.emit("Running layout analysis...")
        OcrWorker.ensure_model_loaded()
        img = Image.open(self.image_path).convert('RGB')
        inputs = OcrWorker._processor(images=[img], return_tensors='pt')
        inputs = {key: value.to(OcrWorker._device) for key, value in inputs.items()}
        target_sizes = torch.tensor([img.size[::-1]], device=OcrWorker._device)

        with torch.no_grad():
            outputs = OcrWorker._model(**inputs)

        raw_results = OcrWorker._processor.post_process_object_detection(
            outputs, threshold=LAYOUT_THRESHOLD, target_sizes=target_sizes
        )[0]
Enter fullscreen mode Exit fullscreen mode

Internally, the shared layout loader supports both cached and explicit local-directory loading:

model_source = LAYOUT_MODEL_DIR or LAYOUT_MODEL_ID
load_kwargs: Dict[str, Any] = {}
if LAYOUT_LOCAL_ONLY or LAYOUT_MODEL_DIR:
    load_kwargs["local_files_only"] = True

cls._processor = PPDocLayoutV3ImageProcessor.from_pretrained(model_source, **load_kwargs)
cls._model = PPDocLayoutV3ForObjectDetection.from_pretrained(model_source, **load_kwargs)
Enter fullscreen mode Exit fullscreen mode

You can choose your own download source when preparing the model, or skip the network entirely by shipping the model folder with your deployment.

The overlay preserves more than just text targets. Text, table, and formula regions remain interactive, while non-text categories such as image stay visible with their own coloring and labels.

This matters because PP-DocLayoutV3 can tell you that a region is broadly an image block, but it does not distinguish whether that image is a portrait, logo, illustration, or seal. If you want portrait detection, you need an extra classifier or detector on top of the layout model.

Step 5: Let Coordinate-Aware OCR Replace the Layout Overlay

Some OCR engines can return text coordinates directly. In that case, the app uses OCR-native boxes instead of leaving the earlier layout overlay on screen.

def _engine_returns_text_coordinates(engine_key: str) -> bool:
    return engine_key in {OCR_ENGINE_ONEOCR, OCR_ENGINE_PADDLEOCR}
Enter fullscreen mode Exit fullscreen mode

When the user clicks Run OCR, the app clears any existing layout result for coordinate-aware engines so the overlay gets rebuilt from OCR output:

def _on_ocr_clicked(self):
    if self.selected_item is None or self.worker is not None or self.layout_worker is not None:
        return

    self._set_busy(True)
    self.ocr_result = None
    self.result_text.clear()

    if self.selected_item is not None:
        self.selected_item.ocr_result = None
        if _engine_returns_text_coordinates(self.selected_ocr_engine):
            self.selected_item.layout_result = None

    self.worker = OcrWorker(
        self.selected_item.file_path,
        mode=self.ocr_mode_combo.currentText(),
        engine_key=self.selected_ocr_engine,
    )
Enter fullscreen mode Exit fullscreen mode

That design gives users a predictable workflow:

  • Layout Analysis shows document structure.
  • Run OCR returns text.
  • OneOCR and PaddleOCR can replace the overlay with text boxes.
  • OCR engines without coordinates do not destroy the existing layout overlay.

Step 6: Use OneOCR and PaddleOCR as Coordinate-Aware Backends

OneOCR is Windows-only, but it returns OCR line boxes directly. The app converts those boxes into the same normalized overlay format used elsewhere.

if self.engine_key == OCR_ENGINE_ONEOCR:
    self.progress.emit("Running OneOCR...")
    engine = _ensure_oneocr_engine_loaded()
    raw_result = engine.recognize_pil(img)
    layout_result = _build_oneocr_layout_result(raw_result, img.size[0], img.size[1])
    if layout_result.has_coordinates:
        self.layout_ready.emit(layout_result)
    full_text = _clean_content(_extract_oneocr_text(raw_result))
    self.ocr_done.emit(OcrResult(
        text=full_text,
        line_count=layout_result.region_count,
        has_coordinates=layout_result.has_coordinates,
        raw_response=full_text,
    ))
    return
Enter fullscreen mode Exit fullscreen mode

PaddleOCR can do the same thing. In this project, its result payload exposes rec_polys, rec_boxes, and dt_polys, so the app can build overlay boxes directly from OCR output.

if self.engine_key == OCR_ENGINE_PADDLEOCR:
    self.progress.emit("Running PaddleOCR...")
    engine = _ensure_paddleocr_engine_loaded()
    raw_result = _predict_with_paddleocr(engine, image_path=self.image_path)
    layout_result = _build_paddleocr_layout_result(raw_result, img.size[0], img.size[1])
    if layout_result.has_coordinates:
        self.layout_ready.emit(layout_result)
    full_text = _clean_content(_extract_paddleocr_text(raw_result))
    self.ocr_done.emit(OcrResult(
        text=full_text,
        line_count=layout_result.region_count,
        has_coordinates=layout_result.has_coordinates,
        raw_response=full_text,
    ))
    return
Enter fullscreen mode Exit fullscreen mode

That means both engines can behave like coordinate-aware desktop OCR, not just plain text extractors.

Step 7: Keep Ollama as an Optional LLM OCR Backend

The app still supports Ollama (GLM-OCR), but it is now treated as optional. On a machine without GPU acceleration, it can be much slower than OneOCR or PaddleOCR, so it should not define the title or the main architecture.

The integration is still useful when you want prompt-controlled OCR modes such as text, table, or formula recognition.

def _backend_supports_prompt_modes(engine_key: str) -> bool:
    return engine_key == OCR_ENGINE_OLLAMA


def _ocr_generate_b64_list(image_payloads: List[bytes], prompt: str,
                           model: str = OLLAMA_MODEL) -> str:
    payload = _ollama_generate_request(prompt=prompt, image_payloads=image_payloads, model=model)
    return payload.get("response", "") or ""
Enter fullscreen mode Exit fullscreen mode

If you do not need it, you can simply use OneOCR or PaddleOCR and ignore the Ollama setup entirely.

Step 8: OCR Individual Layout Regions on Demand

Instead of OCRing every region immediately, the app lets users click a single highlighted block and OCR only that crop.

class RegionOcrWorker(QThread):
    region_ocr_done = Signal(int, str)
    error = Signal(str)

    def run(self):
        try:
            cropped_image = _prepare_region_image(self.image_path, self.bbox_2d)
            content = _ocr_with_engine(
                self.engine_key,
                pil_image=cropped_image,
                task_type=self.task_type,
            )
            display_text = _clean_content(content)
            self.region_ocr_done.emit(self.region_index, display_text)
        except Exception as e:
            import traceback
            self.error.emit(f"Region OCR failed: {e}\n\n{traceback.format_exc()}")
Enter fullscreen mode Exit fullscreen mode

The preview widget distinguishes between interactive OCR targets and non-text layout regions.

def _is_clickable_region(line: OcrLine) -> bool:
    return line.task_type in {"text", "table", "formula"}
Enter fullscreen mode Exit fullscreen mode

Step 9: Scan Pages Through Dynamic Web TWAIN Service

desktop document scanner with OCR and layout analysis

On the scanner side, the project uses twain-wia-sane-scanner to talk to Dynamic Web TWAIN Service. It enumerates TWAIN and WIA devices, creates a job, streams pages, and saves them as images for the same OCR workflow.

class ScanWorker(QThread):
    page_scanned = Signal(str)
    scan_done = Signal(int)
    error = Signal(str)
    progress = Signal(str)

    def run(self):
        controller = ScannerController(timeout=120, raise_errors=True)
        job_id = ""
        page_count = 0
        job = controller.createJob(self.host, {
            "license": self.license_key,
            "device": self.device,
            "autoRun": False,
            "jobTimeout": 180,
            "scannerFailureTimeout": 90,
            "config": DWT_SCAN_CONFIG,
        })
        job_id = job.get("jobuid", "")
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

        while True:
            image_bytes = controller.getImageStream(self.host, job_id, imageType="image/png")
            if image_bytes is None:
                break

            page_count += 1
            image_path = self.output_dir / f"scan_{timestamp}_{page_count:03d}.png"
            image_path.write_bytes(image_bytes)
            self.page_scanned.emit(str(image_path))
Enter fullscreen mode Exit fullscreen mode

The scan configuration passed to createJob disables the native scanner UI and enables the document feeder so multi-page scans flow unattended:

DWT_SCAN_CONFIG = {
    "IfShowUI": False,
    "PixelType": 2,
    "Resolution": 200,
    "IfFeederEnabled": True,
    "IfDuplexEnabled": False,
}
Enter fullscreen mode Exit fullscreen mode

Step 10: Keep the UI Responsive While OCR and Scanning Run

The project also guards against a common desktop pitfall: destroying worker threads while a blocking OCR or scan request is still active.

def _has_running_worker(self) -> bool:
    return any(
        worker is not None and worker.isRunning()
        for worker in (
            self.worker,
            self.layout_worker,
            self.region_worker,
            self.scan_worker,
            self.preload_worker,
        )
    )
Enter fullscreen mode Exit fullscreen mode

When the user tries to close the window, closeEvent checks this helper and shows a message box instead of letting Qt destroy live threads.

Source Code

https://github.com/yushulx/python-twain-wia-sane-scanner/tree/main/example/pyside6-ocr

Top comments (0)