OCR - LogicalDOC Documentation

Turn paper documents into full-text searchable digital files and manage them in a paperless document management system that incorporates advanced OCR software. Quickly and easily apply all the tools and functions of electronic document management to hard-copy documents and previously scanned files. By leveraging the best of breed OCR technologies, LogicalDOC is able to extract texts from images and raster PDFs acquired from massive scans from your multi-function device.

Performance drawbacks

OCR processing usually takes a long time and high CPU load to index a single document, so if you activate OCR, expect to have much higher time to index your repository

OCR of a scan

You don't have to explicitly ask for OCRing your files, just store them in LogicalDOC and the OCR will be used automatically at indexing time to extract the texts from your images or raster PDFs.

Remember that this is not a zonal OCR, it just extracts all the texts in order to allow you to perform full-text searches.

Configuration of the OCR

You can set how the OCR works by changing the configurations in this panel.

Enabled: to enable or disable the OCR processing
Timeout: maximum number of seconds to process a single file
Include: comma-separated list of file name patterns for the files to include
Exclude: comma-separated list of file name patterns for the files to exclude
Max. size: maximum size of the file to be processed
Text threshold: used for PDFs only, indicates the weight of textual contents against the other kind of contents. If the textual contents is less than this threshold, the document is interpreted as raster and the OCR is executed
Image min. width: minimum dimension for the images to be processed
Rendering res.: Sometimes the file needs to be printed as PDF, this parameter specifies the print resolution
Batch: number of pages processed by the OCR at once
Batch timeout: maximum number of seconds to process a batch
Allowed threads: maximum number of threads allowed to use the OCR concurrently
Wait for thread: maximum number of seconds to wait for getting access to the OCR
Error on empty extraction: if an error must be raised in case the OCR did not extract anything
Record events: if you want to record the OCR events (you will see them in the History tab)
Engine: what engine to use

Supported OCR engines

You can choose one of the supported OCR engines
OCR Engine	Description	Configuration
Tesseract	The famous open source OCR engine handled by Google	path: absolute installation path of the tesseract executable (make sure to put this path in the allowed commands)
OCR Web Service	A lightweight online OCR engine	username: your own OCR Web Service username licenseCode: your own license code associated to your OCR Web Service account
Power PDF Advanced	An OCR engine developed by Tungsten	path: absolute installation path of Power PDF Advanced (make sure to put this path in the allowed commands)

History

In the History tab, you see the list of recorded events related to the OCR extractions: