Paper documents still enter desktop workflows through scanners, but many sample apps stop at acquisition and never connect the pages to a practical OCR workflow. This project fills that gap with a Windows desktop application that scans documents through Dynamic Web TWAIN Service, runs local layout analysis, and lets users compare multiple OCR backends inside one PySide6 interface.
The current app supports three OCR options:
-
OneOCRfor Windows-native OCR with text boxes. -
PaddleOCRfor fully local OCR with returned text coordinates. -
Ollama (GLM-OCR)as an optional backend when you want to experiment with VLM-style OCR.
Ollama is intentionally optional here. On a CPU-only PC, it may be much slower than the other local OCR choices.
What you'll build: A Windows PySide6 desktop app that scans paper documents, visualizes layout regions, and runs full-page or region-level OCR with Dynamic Web TWAIN Service, PP-DocLayoutV3, OneOCR, PaddleOCR, and an optional Ollama backend.
Demo Video: Windows Document Scanner and OCR App
Prerequisites
- Windows with Python 3.11 or later.
- PySide6, Pillow, PyTorch, transformers, NumPy, and
twain-wia-sane-scanner. - Dynamic Web TWAIN Service installed and reachable at
http://127.0.0.1:18622. - A Dynamic Web TWAIN license key for scanner access.
Optional OCR backends:
-
OneOCRon Windows, plus its runtime files from Windows 11 Snipping Tool. -
PaddleOCRwith eitherpaddle_staticortransformersinference. - Ollama with
glm-ocr:latestif you want to test an LLM OCR backend.
Step 1: Install Python Dependencies and Configure the OCR Environment
Install the same packages used by the sample project.
python -m pip install -r requirements.txt
If you want the PP-DocLayoutV3 layout model to run fully offline, download it once before launching the app:
huggingface-cli download PaddlePaddle/PP-DocLayoutV3_safetensors --local-dir D:\models\PP-DocLayoutV3_safetensors
$env:LAYOUT_MODEL_DIR="D:\models\PP-DocLayoutV3_safetensors"
$env:LAYOUT_LOCAL_ONLY="1"
The current dependency set is:
PySide6>=6.5
Pillow>=9.0
torch>=2.0
transformers>=4.0
numpy>=1.20
twain-wia-sane-scanner>=2.0.3
oneocr>=1.0.12; platform_system == "Windows"
paddleocr>=3.6.0
If you want to enable the optional Ollama backend as well:
ollama pull glm-ocr:latest
Step 2: Define OCR Engine Keys and Runtime Defaults
The Python app defines explicit OCR engine keys and lets the UI switch between them.
OCR_ENGINE_OLLAMA = "ollama"
OCR_ENGINE_ONEOCR = "oneocr"
OCR_ENGINE_PADDLEOCR = "paddleocr"
OCR_ENGINE_LABELS = {
OCR_ENGINE_OLLAMA: "Ollama (GLM-OCR)",
OCR_ENGINE_ONEOCR: "OneOCR",
OCR_ENGINE_PADDLEOCR: "PaddleOCR",
}
def _default_paddleocr_engine() -> str:
if importlib.util.find_spec("paddle") is not None:
return "paddle_static"
return "transformers"
PADDLEOCR_ENGINE = os.environ.get("PADDLEOCR_ENGINE", _default_paddleocr_engine())
PADDLEOCR_DEVICE = os.environ.get(
"PADDLEOCR_DEVICE",
"gpu" if torch.cuda.is_available() else "cpu",
)
PADDLEOCR_DET_MODEL = os.environ.get("PADDLEOCR_DET_MODEL", "PP-OCRv5_mobile_det")
PADDLEOCR_REC_MODEL = os.environ.get("PADDLEOCR_REC_MODEL", "PP-OCRv5_mobile_rec")
LAYOUT_MODEL_DIR = os.environ.get("LAYOUT_MODEL_DIR", "").strip()
LAYOUT_LOCAL_ONLY = os.environ.get("LAYOUT_LOCAL_ONLY", "").strip().lower() in {
"1", "true", "yes", "on"
}
This keeps desktop defaults practical. In particular, PaddleOCR now prefers the lighter PP-OCRv5 mobile detection and recognition models, which are much more suitable than heavier defaults on a typical PC.
For layout analysis, the same idea applies: you can either let transformers resolve PaddlePaddle/PP-DocLayoutV3_safetensors through the default endpoint or mirror you provide, or you can point the app at a pre-downloaded local directory and run without network access.
Step 3: Build a Toolbar with Separate Layout and OCR Actions
The current UI makes layout analysis a first-class action. That keeps the workflow explicit and avoids surprising users when OCR silently changes the overlay.
self.layout_action = QAction("Layout Analysis", self)
self.layout_action.setIcon(QIcon.fromTheme("insert-object"))
self.layout_action.triggered.connect(self._on_layout_clicked)
self.layout_action.setEnabled(False)
toolbar.addAction(self.layout_action)
self.ocr_action = QAction("Run OCR", self)
self.ocr_action.setIcon(QIcon.fromTheme("find"))
self.ocr_action.triggered.connect(self._on_ocr_clicked)
self.ocr_action.setEnabled(False)
toolbar.addAction(self.ocr_action)
self.ocr_engine_combo = QComboBox()
self.ocr_engine_combo.currentIndexChanged.connect(self._on_ocr_engine_changed)
toolbar.addWidget(self.ocr_engine_combo)
self.ocr_mode_label = QLabel("Mode:")
toolbar.addWidget(self.ocr_mode_label)
self.ocr_mode_combo = QComboBox()
self.ocr_mode_combo.addItems([
"Text Recognition", "Table Recognition", "Figure Recognition",
])
toolbar.addWidget(self.ocr_mode_combo)
The Mode control is only relevant to Ollama (GLM-OCR), so the app hides it for the other engines instead of pretending every OCR backend shares the same prompt model.
Step 4: Run Layout Analysis Independently Before OCR
Layout Analysis runs PP-DocLayoutV3 by itself and draws an overlay of document regions without triggering any OCR pass.
class LayoutAnalysisWorker(QThread):
layout_ready = Signal(LayoutResult)
progress = Signal(str)
error = Signal(str)
def __init__(self, image_path: str, parent=None):
super().__init__(parent)
self.image_path = image_path
def run(self):
self.progress.emit("Running layout analysis...")
OcrWorker.ensure_model_loaded()
img = Image.open(self.image_path).convert('RGB')
inputs = OcrWorker._processor(images=[img], return_tensors='pt')
inputs = {key: value.to(OcrWorker._device) for key, value in inputs.items()}
target_sizes = torch.tensor([img.size[::-1]], device=OcrWorker._device)
with torch.no_grad():
outputs = OcrWorker._model(**inputs)
raw_results = OcrWorker._processor.post_process_object_detection(
outputs, threshold=LAYOUT_THRESHOLD, target_sizes=target_sizes
)[0]
Internally, the shared layout loader supports both cached and explicit local-directory loading:
model_source = LAYOUT_MODEL_DIR or LAYOUT_MODEL_ID
load_kwargs: Dict[str, Any] = {}
if LAYOUT_LOCAL_ONLY or LAYOUT_MODEL_DIR:
load_kwargs["local_files_only"] = True
cls._processor = PPDocLayoutV3ImageProcessor.from_pretrained(model_source, **load_kwargs)
cls._model = PPDocLayoutV3ForObjectDetection.from_pretrained(model_source, **load_kwargs)
You can choose your own download source when preparing the model, or skip the network entirely by shipping the model folder with your deployment.
The overlay preserves more than just text targets. Text, table, and formula regions remain interactive, while non-text categories such as image stay visible with their own coloring and labels.
This matters because PP-DocLayoutV3 can tell you that a region is broadly an image block, but it does not distinguish whether that image is a portrait, logo, illustration, or seal. If you want portrait detection, you need an extra classifier or detector on top of the layout model.
Step 5: Let Coordinate-Aware OCR Replace the Layout Overlay
Some OCR engines can return text coordinates directly. In that case, the app uses OCR-native boxes instead of leaving the earlier layout overlay on screen.
def _engine_returns_text_coordinates(engine_key: str) -> bool:
return engine_key in {OCR_ENGINE_ONEOCR, OCR_ENGINE_PADDLEOCR}
When the user clicks Run OCR, the app clears any existing layout result for coordinate-aware engines so the overlay gets rebuilt from OCR output:
def _on_ocr_clicked(self):
if self.selected_item is None or self.worker is not None or self.layout_worker is not None:
return
self._set_busy(True)
self.ocr_result = None
self.result_text.clear()
if self.selected_item is not None:
self.selected_item.ocr_result = None
if _engine_returns_text_coordinates(self.selected_ocr_engine):
self.selected_item.layout_result = None
self.worker = OcrWorker(
self.selected_item.file_path,
mode=self.ocr_mode_combo.currentText(),
engine_key=self.selected_ocr_engine,
)
That design gives users a predictable workflow:
-
Layout Analysisshows document structure. -
Run OCRreturns text. -
OneOCRandPaddleOCRcan replace the overlay with text boxes. - OCR engines without coordinates do not destroy the existing layout overlay.
Step 6: Use OneOCR and PaddleOCR as Coordinate-Aware Backends
OneOCR is Windows-only, but it returns OCR line boxes directly. The app converts those boxes into the same normalized overlay format used elsewhere.
if self.engine_key == OCR_ENGINE_ONEOCR:
self.progress.emit("Running OneOCR...")
engine = _ensure_oneocr_engine_loaded()
raw_result = engine.recognize_pil(img)
layout_result = _build_oneocr_layout_result(raw_result, img.size[0], img.size[1])
if layout_result.has_coordinates:
self.layout_ready.emit(layout_result)
full_text = _clean_content(_extract_oneocr_text(raw_result))
self.ocr_done.emit(OcrResult(
text=full_text,
line_count=layout_result.region_count,
has_coordinates=layout_result.has_coordinates,
raw_response=full_text,
))
return
PaddleOCR can do the same thing. In this project, its result payload exposes rec_polys, rec_boxes, and dt_polys, so the app can build overlay boxes directly from OCR output.
if self.engine_key == OCR_ENGINE_PADDLEOCR:
self.progress.emit("Running PaddleOCR...")
engine = _ensure_paddleocr_engine_loaded()
raw_result = _predict_with_paddleocr(engine, image_path=self.image_path)
layout_result = _build_paddleocr_layout_result(raw_result, img.size[0], img.size[1])
if layout_result.has_coordinates:
self.layout_ready.emit(layout_result)
full_text = _clean_content(_extract_paddleocr_text(raw_result))
self.ocr_done.emit(OcrResult(
text=full_text,
line_count=layout_result.region_count,
has_coordinates=layout_result.has_coordinates,
raw_response=full_text,
))
return
That means both engines can behave like coordinate-aware desktop OCR, not just plain text extractors.
Step 7: Keep Ollama as an Optional LLM OCR Backend
The app still supports Ollama (GLM-OCR), but it is now treated as optional. On a machine without GPU acceleration, it can be much slower than OneOCR or PaddleOCR, so it should not define the title or the main architecture.
The integration is still useful when you want prompt-controlled OCR modes such as text, table, or formula recognition.
def _backend_supports_prompt_modes(engine_key: str) -> bool:
return engine_key == OCR_ENGINE_OLLAMA
def _ocr_generate_b64_list(image_payloads: List[bytes], prompt: str,
model: str = OLLAMA_MODEL) -> str:
payload = _ollama_generate_request(prompt=prompt, image_payloads=image_payloads, model=model)
return payload.get("response", "") or ""
If you do not need it, you can simply use OneOCR or PaddleOCR and ignore the Ollama setup entirely.
Step 8: OCR Individual Layout Regions on Demand
Instead of OCRing every region immediately, the app lets users click a single highlighted block and OCR only that crop.
class RegionOcrWorker(QThread):
region_ocr_done = Signal(int, str)
error = Signal(str)
def run(self):
try:
cropped_image = _prepare_region_image(self.image_path, self.bbox_2d)
content = _ocr_with_engine(
self.engine_key,
pil_image=cropped_image,
task_type=self.task_type,
)
display_text = _clean_content(content)
self.region_ocr_done.emit(self.region_index, display_text)
except Exception as e:
import traceback
self.error.emit(f"Region OCR failed: {e}\n\n{traceback.format_exc()}")
The preview widget distinguishes between interactive OCR targets and non-text layout regions.
def _is_clickable_region(line: OcrLine) -> bool:
return line.task_type in {"text", "table", "formula"}
Step 9: Scan Pages Through Dynamic Web TWAIN Service
On the scanner side, the project uses twain-wia-sane-scanner to talk to Dynamic Web TWAIN Service. It enumerates TWAIN and WIA devices, creates a job, streams pages, and saves them as images for the same OCR workflow.
class ScanWorker(QThread):
page_scanned = Signal(str)
scan_done = Signal(int)
error = Signal(str)
progress = Signal(str)
def run(self):
controller = ScannerController(timeout=120, raise_errors=True)
job_id = ""
page_count = 0
job = controller.createJob(self.host, {
"license": self.license_key,
"device": self.device,
"autoRun": False,
"jobTimeout": 180,
"scannerFailureTimeout": 90,
"config": DWT_SCAN_CONFIG,
})
job_id = job.get("jobuid", "")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
while True:
image_bytes = controller.getImageStream(self.host, job_id, imageType="image/png")
if image_bytes is None:
break
page_count += 1
image_path = self.output_dir / f"scan_{timestamp}_{page_count:03d}.png"
image_path.write_bytes(image_bytes)
self.page_scanned.emit(str(image_path))
The scan configuration passed to createJob disables the native scanner UI and enables the document feeder so multi-page scans flow unattended:
DWT_SCAN_CONFIG = {
"IfShowUI": False,
"PixelType": 2,
"Resolution": 200,
"IfFeederEnabled": True,
"IfDuplexEnabled": False,
}
Step 10: Keep the UI Responsive While OCR and Scanning Run
The project also guards against a common desktop pitfall: destroying worker threads while a blocking OCR or scan request is still active.
def _has_running_worker(self) -> bool:
return any(
worker is not None and worker.isRunning()
for worker in (
self.worker,
self.layout_worker,
self.region_worker,
self.scan_worker,
self.preload_worker,
)
)
When the user tries to close the window, closeEvent checks this helper and shows a message box instead of letting Qt destroy live threads.
Source Code
https://github.com/yushulx/python-twain-wia-sane-scanner/tree/main/example/pyside6-ocr


Top comments (0)