OCR

Turn paper documents into full-text searchable digital files and manage them in a paperless document management system that incorporates advanced OCR software. Quickly and easily apply all the tools and functions of electronic document management to hard-copy documents and previously scanned files. By leveraging the best of breed OCR technologies, LogicalDOC is able to extract texts from images and raster PDFs acquired from massive scans from your multi-funtion device.

Performance drawbacks

OCR processing usually takes a long time and high CPU load to index a single document, so if you activate OCR, expect to have much higher time to index your repository

OCR of a scan

You don't have to explicitly ask for OCRing your files, just store them in LogicalDOC and the OCR will be used automatically at indexing time to extract the texts from your images or raster PDFs.

Remember that this is not a zonal OCR, it just extracts all the texts in order to allow you to perform full-text searches.

Configuration of the OCR

You can set how the OCR works by changing the configurations in this panel.

  • Enabled: to enable or disable the OCR processing
  • Timeout: maximum number of seconds to process a single file
  • Include: comma-separated list of file name patterns for the files to include
  • Exclude: comma-separated list of file name patterns for the files to exclude
  • Max. size: maximum size of the file to be processed
  • Text threshold: used for PDFs only, indicates the weight of textual contents against the other kind of contents. If the textual contents is less than this threshold, the document is interpreted as raster and the OCR is executed
  • Image min. width: minimum dimension for the images to be processed
  • Rendering res.: Sometimes the file needs to be printed as PDF, this parameter specifies the print resolution
  • Batch: number of pages processed by the OCR at once
  • Batch timeout: maximum number of seconds to process a batch
  • Allowed threads: maximum number of threads allowed to use the OCR concurrently
  • Wait for thread: maximum number of seconds to wait for getting access to the OCR
  • Error on empty extraction: if an error must be raised in case the OCR did not extract anything
  • Record events: if you want to record the OCR events(you will see them in the History tab)
  • Engine: what engine to use

Supported OCR engines

You can choose one of the supported OCR engines

OCR Engine Description Configuration
Tesseract The famous open source OCR engine handled by Google

path: absolute installation path of the tesseract executable (make sure to put this path in the allowed commands)

OCR Web Service A lightweight online OCR engine

username: your own OCR Web Service username

licenseCode: your own license code associated to your OCR Web Service account

Power PDF Advanced An OCR engine developed by Nuance

path: absolute installation path of Power PDF Advanced (make sure to put this path in the allowed commands)

History

In the History tab you see the list of recorded events related to the OCR extractions:

FSVpAnkX0AAUa6o