Tesseract HOCR Plugin

Overview

The TESSERACT_HOCR plugin is commonly used in the Page Processing module.

This plugin reads the image files listed in the batch.xml file for a batch, generates an HOCR.xml file for each image, and updates the batch.xml accordingly.

Configuration

Configurable Properties

Following are the list of configurable properties for TESSERACT_HOCR plugin from the UI:

Configurable property Type of value Value options Description
Tesseract Switch List of values ON, OFF This switch is used to turn this plugin ON/OFF. If this switch is OFF, this plugin won’t do anything.
Tesseract color switch List of values ON, OFF Tesseract is unable to read colored TIFFs. Hence, in case of colored images (i.e. when one switches ON the color switch), we send the PNGs for OCRing instead. Hence switching the color switch ON would be helpful for batch classes where one expects to have colored TIFF images.
Tesseract Language String N/A This option provides the user an option to select the language one wants to use for OCRing. At present Tesseract supports only single language per image file OCRing.E.g.: specify ‘eng’ for English, ‘tur’– for Turkish etc.
Tesseract Version String N/A This option provides the user an option to define the Tesseract version installed in system. For example: specify ‘tesseract_version_3’ for Tesseract 3.0, ‘tesseract_version_2’– for Tesseract 2.0 etc.
Tesseract Valid Extensions Multi-select tif, gif, png The file extensions that this plugin will support.

Steps of execution

    • This plugin works in the Page Process phase of Transact after the import processing is complete.
    • The plugin performs OCR for all the input images.
    • After all the work is done, it writes the name of each HOCR file in its batch.xml and generates HOCR output in the form of html and HOCR.xml.

Dependency

This plugin only requires an image as an input (PNG if color switch is ON, TIFF if color switch is OFF).  Therefore, either the “Create OCR Input Plugin” or the “Create Display Image Plugin” must run before this plugin.

Troubleshooting

The following table lists several possible error messages that could appear for this plugin, and explanations of what each error message means.

Error message Possible root cause
Tesseract Base path not configured. Environment variable for Tesseract is either not set or path is configured incorrectly.
Space found in the name of image: xyz.png. So it cannot be processed One or more spaces were found in the file name.  Remove the spaces from the image name and restart the batch from Page Process module.
No valid extensions are specified in resources No extensions were specified for this plugin.
Image Processing or XML updation failed for image: xyz The image file being processed has a file extension that isn’t included in the list of valid extensions for the plugin.

 

<Back| 4.0.0.0 Release Documentation