Search Classification Plugin

Introduction

This document describes how to configure and use the Search Classification plugin. The plugin classifies documents in the Page Process module of the workflow using Lucene-based indexing. Classification is how Ephesoft Transact chooses or associates the document to the Document Type. This document applies to Ephesoft Transact 2019.1 and above.

Configuring the Search Classification Plugin

Perform the following steps to configure the SEARCH_CLASSIFICATION plugin in the Page Process module. You must have administrator rights to complete these steps.

  1. Launch Ephesoft Transact and navigate to Administrator > Batch Class Management. Enter login credentials when prompted.
  2. Select an existing batch class and click Open or create a new batch class. You can also copy or import an existing batch class, then modify it to create a new batch class.
    The following figure illustrates the SEARCH_CLASSIFICATION plugin in a typical batch class configuration.

Navigation to SEARCH_CLASSIFICATION Plugin

The SEARCH_CLASSIFICATION plugin works independently of the MULTIDIMENSIONAL_CLASSIFICATION_PLUGIN in the Page Process module. Both plugins can be present in the module.

      3. Select the SEARCH_CLASSIFICATION plugin to set up the configuration. The Plugin Configuration screen for the SEARCH_CLASSIFICATION plugin displays.

SEARCH_CLASSIFICATION Plugin Configuration Screen

Configurable Properties

The following table lists and defines the configurable properties for the Search Classification plugin:

Configurable Property Type of Value Value Options Description
Lucene Valid Extensions List of Values xml

html

This field defines the valid extension of the input file and is applied when classifying document types for the specified file format.
Lucene Min Term Frequency Integer NA This field sets the frequency below which terms will be ignored in the source document.
Lucene Min Document Frequency Integer NA This field sets the frequency at which words are ignored. When a word does not occur in at least x amount of documents indicated in this field, it gets ignored.
Lucene Min Word Length Integer NA This field sets the minimum word length. Words smaller than this setting are ignored from the HOCR content.
Lucene Min Query Terms Integer NA This field sets the minimum number of query terms that will be included in any generated query.
Lucene Top Level Field String NA This property is used to configure the default field for query terms.
Lucene No Of Pages Integer NA This property specifies the number of documents to be returned in a query search.
Lucene Index Fields List of Values title

summary

This property is used as an index field for searching the document type using Lucene.
Lucene Stop Words List of Values title

name

This property sets the words to be ignored when classifying a document.
Search Classification Switch List of Values ON

OFF

This property enables or disables the SEARCH_CLASSIFICATION plugin for the batch class.
Search Classification Max Results Integer NA This field defines the maximum number of alternate value results that will be generated in the batch.xml.

The default value for this field is 5 in Ephesoft Transact to control the overall size of the batch.xml file.

First Page Confidence Score Value Integer NA This property is used to update the confidence score based on the first page type.
Middle Page Confidence Score Value Integer NA This property is used to update the confidence score based on the middle page type.
Last Page Confidence Score Value Integer NA This property is used to update the confidence score based on the last page type.

4. Define the settings, then click Deploy to save and enable the changes.

Search Classification Execution Process

This plugin operates in the Page Process module after all batch-level import processes are complete.

Ephesoft recommends that document learning is completed for the batch class prior to using this plugin. This plugin classifies incoming document images using Lucene-based indexing. This plugin functions in two stages when classifying documents:

  • Learning — The learning process occurs when generating indexes for documents. This plugin uses the generated indexes to classify each document. This plugin uses the learned files that were created earlier in the workflow.
  • Classification — When this plugin classifies a document, the data it learns provides a reference for document classification. When this plugin classifies a document type, it uses the extracted HOCR content from the image and verifies the HOCR content, based on the data it learned in the previous learning process.

The plugin generates HOCR content similar to the RecoStar HOCR and Tesseract HOCR plugins.

  • After all images and documents in the batch instance have been classified, this plugin writes the data to the batch.xml file for the document type that is being classified.

Troubleshooting

The following table lists the possible error messages that may occur with this plugin along with a description of each possible root cause.

Error message Possible root cause
No index files exist inside folder The document learning is not complete for the batch class.
Page Types not configured in Database. The index data contains invalid indexes for the batch class.
CorruptIndexException while reading Index. The index data is corrupt in the index folder for the batch class.
IOException while reading Index The plugin is unable to open the index data due to corruption in the get index file process, or there is a lock on the index file.
No valid extensions are specified in resources The page contains an invalid HOCR file for processing.
No pages found in batch XML. The pages tag was not found in the incoming batch.xml file.

Conclusion

This concludes instructions to configure and troubleshoot the Search Classification plugin for a batch class.

For additional information about configuring or using classification in Ephesoft Transact, refer to the following documents:

For additional information about batch class creation, setup and configuration, refer to the following documents: