Available: on-premises, cloud
Introduction
This document describes how to configure and use the Search Classification plugin. The plugin classifies documents in the Page Process module of the workflow using Lucene-based indexing. Classification is how Ephesoft Transact chooses or associates the document to the Document Type. This document applies to Ephesoft Transact 2019.1 and above.
Configuring the Search Classification Plugin
Perform the following steps to configure the SEARCH_CLASSIFICATION plugin in the Page Process module. You must have administrator rights to complete these steps.
- Launch Ephesoft Transact and navigate to Administrator > Batch Class Management. Enter login credentials when prompted.
- Select an existing batch class and click Open or create a new batch class. You can also copy or import an existing batch class, then modify it to create a new batch class.
The following figure illustrates the SEARCH_CLASSIFICATION plugin in a typical batch class configuration.
Navigation to SEARCH_CLASSIFICATION Plugin
The SEARCH_CLASSIFICATION plugin works independently of the MULTIDIMENSIONAL_CLASSIFICATION_PLUGIN in the Page Process module. Both plugins can be present in the module.
3. Select the SEARCH_CLASSIFICATION plugin to set up the configuration. The Plugin Configuration screen for the SEARCH_CLASSIFICATION plugin displays.
SEARCH_CLASSIFICATION Plugin Configuration Screen
Configurable Properties
The following table lists and defines the configurable properties for the Search Classification plugin:
Configurable Property | Type of Value | Value Options | Description |
---|---|---|---|
Lucene Valid Extensions | List of Values | xml
html |
This field defines the valid extension of the input file and is applied when classifying document types for the specified file format. |
Lucene Min Term Frequency | Integer | NA | This field sets the frequency below which terms will be ignored in the source document. |
Lucene Min Document Frequency | Integer | NA | This field sets the frequency at which words are ignored. When a word does not occur in at least x amount of documents indicated in this field, it gets ignored. |
Lucene Min Word Length | Integer | NA | This field sets the minimum word length. Words smaller than this setting are ignored from the HOCR content. |
Lucene Min Query Terms | Integer | NA | This field sets the minimum number of query terms that will be included in any generated query. |
Lucene Top Level Field | String | NA | This property is used to configure the default field for query terms. |
Lucene No Of Pages | Integer | NA | This property specifies the number of documents to be returned in a query search. |
Lucene Index Fields | List of Values | title
summary |
This property is used as an index field for searching the document type using Lucene. |
Lucene Stop Words | List of Values | title
name |
This property sets the words to be ignored when classifying a document. |
Search Classification Switch | List of Values | ON
OFF |
This property enables or disables the SEARCH_CLASSIFICATION plugin for the batch class. |
Search Classification Max Results | Integer | NA | This field defines the maximum number of alternate value results that will be generated in the batch.xml.
The default value for this field is 5 in Ephesoft Transact to control the overall size of the batch.xml file. |
First Page Confidence Score Value | Integer | NA | This property is used to update the confidence score based on the first page type. |
Middle Page Confidence Score Value | Integer | NA | This property is used to update the confidence score based on the middle page type. |
Last Page Confidence Score Value | Integer | NA | This property is used to update the confidence score based on the last page type. |
4. Define the settings, then click Deploy to save and enable the changes.
Search Classification Execution Process
This plugin operates in the Page Process module after all batch-level import processes are complete.
Ephesoft recommends that document learning is completed for the batch class prior to using this plugin. This plugin classifies incoming document images using Lucene-based indexing. This plugin functions in two stages when classifying documents:
- Learning — The learning process occurs when generating indexes for documents. This plugin uses the generated indexes to classify each document. This plugin uses the learned files that were created earlier in the workflow.
- Classification — When this plugin classifies a document, the data it learns provides a reference for document classification. When this plugin classifies a document type, it uses the extracted HOCR content from the image and verifies the HOCR content, based on the data it learned in the previous learning process.
The plugin generates HOCR content similar to the RecoStar HOCR and Tesseract HOCR plugins.
- After all images and documents in the batch instance have been classified, this plugin writes the data to the batch.xml file for the document type that is being classified.
Troubleshooting
The following table lists the possible error messages that may occur with this plugin along with a description of each possible root cause.
Error message | Possible root cause |
No index files exist inside folder | The document learning is not complete for the batch class. |
Page Types not configured in Database. | The index data contains invalid indexes for the batch class. |
CorruptIndexException while reading Index. | The index data is corrupt in the index folder for the batch class. |
IOException while reading Index | The plugin is unable to open the index data due to corruption in the get index file process, or there is a lock on the index file. |
No valid extensions are specified in resources | The page contains an invalid HOCR file for processing. |
No pages found in batch XML. | The pages tag was not found in the incoming batch.xml file. |
Conclusion
This concludes instructions to configure and troubleshoot the Search Classification plugin for a batch class.
For additional information about configuring or using classification in Ephesoft Transact, refer to the following documents:
For additional information about batch class creation, setup and configuration, refer to the following documents: