OCR

Turn paper documents into full-text searchable digital files and manage them in a paperless document management system that incorporates advanced OCR software. Quickly and easily apply all the tools and functions of electronic document management to hard-copy documents and previously scanned files. By leveraging the best of breed OCR technologies, LogicalDOC is able to extract texts from images and raster PDFs acquired from massive scans from your multi-funtion device.

Performance drawbacks

OCR processing usually takes a long time and high CPU load to index a single document, so if you activate OCR, expect to have much higher time to index your repository

OCR of a scan

You don't have to explicitly ask for OCRing your files, just store them in LogicalDOC and the OCR will be used automatically at indexing time to extract the texts from your images or raster PDFs.

Remember that this is not a zonal OCR, it just extracts all the texts in order to allow you to perform full-text searches.

Configuration of the OCR

You can set how the OCR works by changing the configurations in this panel.

  • Enabled: to enable or disable the OCR processing
  • Timeout: maximum number of seconds to process a single file
  • Include: comma-separated list of file name patterns for the files to include
  • Exclude: comma-separated list of file name patterns for the files to exclude
  • Text threshold: used for PDFs only, indicates the weight of textual contents against the other kind of contents. If the textual contents is less than this threshold, the document is interpreted as raster and the OCR is executed
  • Image min. width: minimum dimension for the images to be processed
  • Rendering res.: Sometimes the file needs to be printed as PDF, this parameter specify the print resolution
  • Rendering res. (barcodes): Sometimes the embedded barcodes need to be printed as PDF, this parameter specify the print resolution
  • Batch: number of pages processed by the OCR at once
  • Engine: what engine to use

Supported OCR engines

You can choose one of the supported OCR engines

OCR Engine Description Configuration
Tesseract The famous open source OCR engine handled by Google

path: absolute installation path of the tesseract executable

OCR Web Service A lightweight online OCR engine

username: your own OCR Web Service username

licenseCode: your own license code associated to your OCR Web Service account

PowerPDF An OCR engine developed by Nuance

path: absolute installation path of PowerPDF