Turn paper documents into full-text searchable digital files and manage them in a paperless document management system that incorporates advanced OCR software. Quickly and easily apply all the tools and functions of electronic document management to hard-copy documents and previously scanned files. By leveraging the best of breed OCR technologies, LogicalDOC is able to extract texts from images and raster PDFs acquired from massive scans from your multi-funtion device.
OCR processing usually takes a long time and high CPU load to index a single document, so if you activate OCR, expect to have much higher time to index your repository
OCR of a scan
You don't have to explicitly ask for OCRing your files, just store them in LogicalDOC and the OCR will be used automatically at indexing time to extract the texts from your images or raster PDFs.
Remember that this is not a zonal OCR, it just extracts all the texts in order to allow you to perform full-text searches.
Configuration of the OCR
You can set how the OCR works by changing the configurations in this panel.
- Enabled: to enable or disable the OCR processing
- Timeout: maximum number of seconds to process a single file
- Include: comma-separated list of file name patterns for the files to include
- Exclude: comma-separated list of file name patterns for the files to exclude
- Max. size: maximum size of the file to be processed
- Text threshold: used for PDFs only, indicates the weight of textual contents against the other kind of contents. If the textual contents is less than this threshold, the document is interpreted as raster and the OCR is executed
- Image min. width: minimum dimension for the images to be processed
- Rendering res.: Sometimes the file needs to be printed as PDF, this parameter specify the print resolution
- Rendering res. (barcodes): Sometimes the embedded barcodes need to be printed as PDF, this parameter specify the print resolution
- Barcode formats: comma separated list of accepted barcode formats(leave blank for all the supported formats)
- Batch: number of pages processed by the OCR at once
- Batch timeout: maximum number of seconds to process a batch
- Engine: what engine to use
Supported OCR engines
You can choose one of the supported OCR engines
|Tesseract||The famous open source OCR engine handled by Google||
path: absolute installation path of the tesseract executable
|OCR Web Service||A lightweight online OCR engine||
username: your own OCR Web Service username
licenseCode: your own license code associated to your OCR Web Service account
|Power PDF Advanced||An OCR engine developed by Nuance||
path: absolute installation path of Power PDF Advanced