Transact

⌘K
  1. Home
  2. Transact
  3. Features and Functions
  4. Administrator Role and Fe...
  5. Modules and Plugins
  6. Extraction Module
  7. Key-Value Extraction Plugin

Key-Value Extraction Plugin

Available: on-premises, cloud

Introduction

In this document, you will learn how to configure and perform key-value (KV) extraction in Ephesoft Transact. With key-value extraction, Ephesoft Transact will search for user-defined keys that are static text on forms and then identify the associated values that are considered the dynamic portions of the document.

This process varies based on the version:

  • 2020.1 and below & all Windows and Linux versions, all KV rules search for the key and value using regular expression (regex) pattern matching.
  • 2020.1.02 and above in Windows versions, Intelligent Character Recognition (ICR) and Optical Mark Recognition (OMR) type KV rules use an algorithm that relies on a snippet of the value area from a document and sends that through a second ICR/OMR operation. These ICR/OMR type KV rules allow for quicker configuration than Fixed Form Extraction and is an alternative method of ICR/OMR configuration for certain use cases.


Figure 1. Extraction Categories

Prerequisites

This section outlines the prerequisites required to perform key-value extraction.

  1. You will need a batch class with a document type configured. For detailed steps, refer to Add New Document Type
  2. You will need to add the KEY_VALUE_EXTRACTION plugin to the Extraction module for the batch class. For more information on managing plugins, refer to Configuring Plugins.

Configure KV Extraction Plugin

This section provides information on how to configure the KEY_VALUE_EXTRACTION plugin. This plugin only needs to be configured once per batch class.

To navigate to the plugin:

  1. Go to the Batch Class Management page.
  2. Select your batch class and click Open.
  3. Go to Modules > Extraction > KEY_VALUE_EXTRACTION.

The following are the configurable properties for this plugin.

Configurable property Options Description
Enable ICR/OMR Extraction*
  • ON
  • OFF
This switch enables hand print extraction, checkbox detection, and signature detection.

This switch must be set to ON for the above ICR/OMR type extractions to occur.

ICR/OMR Engine*
  • Default
  • RecoStar

 

This drop-down defines the ICR/OMR engine that will be used for hand print extraction, checkbox detection, and signature detection.

The default engine is RecoStar.

KV Extraction Switch
  • ON
  • OFF
This switch enables key-value extraction.

This switch must be set to ON for key-value extraction to occur.

* Available in 2020.1.02 and above (Windows).

Create Index Fields

Create index fields for your document type. For help, refer to Add New Index Field.

Create Extraction Rule

This section describes how to create an extraction rule. Extraction rules are set at the index field level for a document type. KV extraction uses regular expressions to define the patterns used to find index field keys and their associated values in a document.

Note: You can have multiple extraction rules for a single index field.

Perform the following steps to create and build the extraction rule:

  1. From the Index Fields folder, select your index field.


Figure 5. Select Index Field

This will open the KV Extraction Rule page by default.


Figure 6. Open KV Extraction Rule

  1. Click Add to define a new key-value pair.


Figure 7. Open KV Extraction Rule

Upload Files

  1. The KV Extraction rule builder will display. Expand the Advance KV Test panel by clicking the arrow at the bottom of the page.


Figure 8. Expand Advance KV Test

  1. Click Select Files or drag and drop files into the box to upload a file.

Note: Supported file formats are PDF and TIFF.


Figure 9. Select and Upload Files

Determine Your Extraction Method

If you are using Ephesoft Transact 2020.1 or below, skip to Machine Print and continue configuring the extraction rule. The configuration options described there are the standard KV extraction options from 4.5.0.0 to 2020.1. This section is only applicable to Ephesoft Transact 2020.1.02 or above.

Determine the extraction method that applies to the field you are extracting:

Extract Methods Options
Machine Print This is an OCR-based extraction method and has the quickest performance of the available Extraction Methods. This is the standard KV extraction option which can extract machine-printed text but will not be able to extract hand-printed text.
Hand/Machine Print This is an ICR-based extraction method. Select this to extract hand-printed text.
Signature Detection This is an OMR-based extraction method. Select this to configure an extraction rule to determine if a signature is present or not. This will return a “true” or “false” value.
Checkbox Detection This is an OMR-based extraction method. Select this to configure an extraction rule to determine if a checkbox or mark is selected or not (refer to figure 1 for examples of mark types). This will return a “true” or “false” value.

Machine Print

  1. From the Extraction Method dropdown, select Machine Print.

Note: If you are using Ephesoft Transact 2020.1 or below you will not need to select the extraction method, as this option is the default.

Edit KV Overlays

Use the overlays to define the key-value pair. The overlays are color-coded boxes that are superimposed on the uploaded files. These overlays are resizable and can be dragged across the image to select the portions of the file to extract.

  1. Drag and resize the key overlay (green) to the text or label that will serve as the key. In the example below, the key is the label “Landlord/Lessor:”


Figure 10. Resize Key Overlay

  1. Click the key overlay to open the Suggest Regex window. This will display the text captured inside the overlay.


Figure 11. Suggest Regex Box

  1. Select your chosen regex and click OK. This regex will be added to the Key field in the left column.

Note: If a key includes special characters that already hold meaning in a regex, such as slashes ( / \ ) or parentheses ( ), you will need to escape these characters with a backslash. For example, if the value regex was “Invoice Amount (USD)”, escaping the special characters (parentheses) will look like “Invoice Amount \(USD\)”.

  1. Drag and resize the value (red) overlay to the text area that corresponds to the key. In the example below, the value field contains the text “Dedham Mall, Pad”.


Figure 12. Resize Value Overlay

  1. Click the value overlay to open the Suggest Regex popup window. The system will read the selected text and provide a recommended regex.


Figure 13. Suggest Regex

  1. Select your chosen regex and click OK. This regex will be added to the Value field in the left column.

Note: For more information on using the regex builder, refer to Automatic Regex Suggestion and Creation.

Configure Extraction Rule

  1. Use the left panel to continue building the extraction rule.
Configurable property Options Description
Use Existing Field for Key N/A When selected, this checkbox allows for a previously extracted value to be the key of another field. If this is not selected, the key is the anchor point to base the value location.
Fuzzy %
  • None
  • 10%
  • 20%
  • 30%
This parameter specifies if the key will still be identified even if there is not an exact match to the key regex pattern.

  • None – the key must be an exact match to the key regex pattern.
  • 10% – allows for 10% of the characters to be inaccurate.
  • 20% – allows for 20% of the characters to be inaccurate.
  • 30% – allows for 30% of the characters to be inaccurate.

Important: If you select a fuzzy percentage, you will not be able to use a regex for the Key pattern. If you select both a fuzzy percentage and key regex, you will not receive any results during extraction.

Fetch
  • ALL
  • FIRST
  • LAST
This parameter specifies how many instances of the value should be extracted from the value zone.

For example, if ALL is selected, all characters from the value zone will be extracted. If FIRST is selected, only the first matching pattern from the value zone will be extracted.

Note: The recommended option when configuring a regex to extract U.S. currency is FIRST.

Page
  • ALL
  • FIRST
  • LAST
This parameter specifies on which page of the document to perform key-value extraction.

For example, if ALL is selected, key-value extraction will be performed on all pages in the document. If FIRST is selected, it will only be performed on the first page in the document.

Zone
  • ALL
  • TOP
  • RIGHT
  • LEFT
  • MIDDLE
  • BOTTOM
This parameter specifies on which section of the document to perform key-value extraction.

For example, if ALL is selected, key-value extraction will be performed on the entire page. If TOP is selected, it will only be performed on the top section of the page.

Weight 0-1 This parameter is a configurable value (0-1) that is multiplied by the confidence score to calculate a new confidence score. This is to allow an extraction rule to be prioritized over another within the same index field.
X Offset N/A The application will set this value based on the placement and size of the key and value overlays.
Y Offset N/A The application will set this value based on the placement and size of the key and value overlays.
Is Wrapped N/A When selected, this checkbox allows for values to be extracted that may be wrapped to another line in the document.

Extraction is not limited to the exact positions of the key and value defined in the extraction rule. Ephesoft Transact will search the specified fetch, page, and zone for the key pattern and return results if it finds the matching value pattern. This allows Ephesoft Transact to extract data from structured, semi-structured, and unstructured documents.

Validate, Test, and Apply Extraction Rule

  1. Click Validate Regex to confirm the value regex is a usable regular expression.


Figure 14. Click Validate Regex

  1. Click Test KV to perform a test extraction.


Figure 15. Click Test KV

Note: If you receive unexpected results or want to make any changes, click Back to return to the editor and adjust the extraction rule.

  1. Click Apply KV.


Figure 16. Click Apply KV to Save Changes

  1. This will return you to the KV Extraction Rule page. Click Apply to save this key-value rule to the index field.


Figure 17. Click Apply to Save the KV Rule

You have successfully set up a KV extraction rule for Machine Print.

Hand/Machine Print

  1. From the Extraction Method dropdown, select Hand/Machine Print.

Edit KV Overlays

Use the overlays to define the key-value pair. The overlays are color-coded boxes that are superimposed on the uploaded files. These overlays are resizable and can be dragged across the image to select the portions of the file to extract.

  1. Drag and resize the key overlay (green) to the text or label that will serve as the key. In the example below, the key is the label “2. VETERAN’S SOCIAL SECURITY NUMBER (SSN)”.


Figure 18. Edit KV Overlays

  1. Click the key overlay to open the Suggest Regex popup window. This will display the text captured inside the overlay.

Figure 19. Suggest Regex Box

Note: If a key includes special characters that already hold meaning in a regex, such as slashes ( / \ ) or parentheses ( ), you will need to prefix these characters with a backslash.

For example, this value regex is “SOCIAL SECURITY NUMBER (SSN)”, escaping the special characters (parentheses) looks like “SOCIAL SECURITY NUMBER \(SSN\)”.


Figure 20. Escape Special Characters

  1. Select your chosen regex and click OK. This regex will be added to the Key field in the left column.

Important: The key regex must be unique. For example, the key “VETERAN” would not be a good choice, as the word “VETERAN” appears multiple times across the page.

  1. Drag and resize the value (red) overlay to the text area that corresponds to the value. In the example above, the value field contains the text “123-45-6789”.


Figure 21. Edit Value Overlay

Important: To allow for variations in possible value length, the overlay should stretch the full length of the field.

Configure Extraction Rule

Use the left panel to continue building the extraction rule.

  1. Select the Extraction Type.
Extraction Type Description
KV + Hand/Machine Print Use this option when you are extracting data from documents that may be filled out in either hand print or machine print.

This extraction type first performs key-value (KV) extraction. If the set minimum KV confidence is not met, it then performs hand/machine print extraction and chooses the best result.

Hand/Machine Print Use this option when you are extracting data from documents that you expect to always be in hand print.

This extraction type only performs hand/machine print extraction.

  1. Define the value regex. The need for this regex depends on the Extraction Type selected in step 6.
    1. If you selected KV + Hand/Machine Print, we recommend using a regex specific to the selected value text.
      1. Click the value overlay to open the Suggest Regex popup window. The system will read the selected text and provide a recommended regex. In the example below, we’ve used the suggested regex for an SSN.
      2. Select your chosen regex and click OK. This regex will be added to the Value field in the left column.


Figure 20. Suggested Regex

    1. If you selected Hand/Machine Print, we recommend using the generic regex (.+), as this will catch any value, any number of times.
      1. Enter “.+” in the Value field.
      2. Click Validate Regex.
  1. Select the Value Type.
Value Type Description
Alphanumeric Returned characters will be from an alphanumeric character set. This can be used when you are extracting:

  • Names
  • Addresses
  • Email addresses
  • Full dates
    • May 11, 2020
    • 11 May 2020
Numeric Returned characters will be from a numeric character set. This can be used when you are extracting:

  • Phone numbers
  • Social Security Numbers
  • Amount totals
  • All numeric dates
    • 5/11/2020
    • 2020-5-11

Known Issue: This option is currently still able to extract alphabetic characters. This will be resolved in 2020.1.03.

  1. If you selected Hand/Machine Print as the extraction type, skip to step 11.
  2. If you selected KV + Hand/Machine Print as the extraction type, set the Minimum KV Confidence. This is a configurable number (1-100) which acts as a threshold to help Ephesoft Transact determine if a second pass using ICR extraction is needed.

To determine this number, we recommend testing this across multiple real-world documents with various levels of quality. This will give you a more accurate idea of where the threshold should be. Follow the steps listed below for each document:

    1. Enter any number in this field and press Test KV.
    2. Go to Folder Management > Shared Folders > [your batch class] > test-advanced-extraction > [your document type].
    3. Locate and open the HOCR.xml file for the document you just tested.
    4. Locate the field you are configuring for extraction. Each word will be listed as a separate value.
    5. Locate the tag <OcrConfidence>. In the sample below, this is “51”.


Figure 22. OcrConfidence Tag

    1. Using multiple real-world documents, determine the threshold for when values are extracted properly. In the table below, we have sample results for documents of various quality.

From these samples, we can see accurate results when the OcrConfidence is above 50.

Note: If your documents are of good OCR quality, you may want to keep this number higher. If your documents are of poor OCR quality, you may want to keep this number lower.

    1. Set your Minimum KV Confidence to the number determined in step 10f.
  1. Keep Remove Lines selected.
    This will remove any straight horizontal and vertical lines, such as the boundaries around a text box.
  1. Deselect Auto-Resize K/V Areas.
    This option is in beta and is not currently recommended for normal, unskewed samples. When this option is selected, Ephesoft Transact will automatically detect and adjust for changes in image content size by adjusting the size and position of the key and value search areas. To use this option, draw the green overlay precisely over the key.

Important: Any additional configurable options on this page are not applicable to Hand/Machine Print extraction. If you make any changes (Fuzzy, Fetch, Page, Zone, Weight, Is Wrapped) they will not affect the extraction rule.

Validate, Test, and Apply Extraction Rule

  1. Click Validate Regex to confirm the value regex is a usable regular expression.


Figure 23. Click Validate Regex

  1. Click Test KV to perform a test extraction.


Figure 24. Click Test KV

Note: If you receive unexpected results or want to make any changes, click Back to return to the editor and adjust the extraction rule. Refer to Troubleshooting for help.

  1. Click Apply KV.


Figure 25. Click Apply KV to Save Changes

  1. This will return you to the KV Extraction Rule page. Click Apply to save this key-value rule to the index field.


Figure 26. Click Apply to Save the KV Rule

You have successfully set up a KV extraction rule for Hand/Machine Print extraction.

Signature Detection

  1. From the Extraction Method dropdown, select Signature Detection.

Edit KV Overlays

Use the overlays to define the key-value pair. The overlays are color-coded boxes that are superimposed on the uploaded files. These overlays are resizable and can be dragged across the image to select the portions of the file to extract.

  1. Drag and resize the key overlay (green) to the text or label that will serve as the key. In the example below, the key is the label “22A. SIGNATURE OF VETERAN”.


Figure 27. Edit KV Overlays

  1. Click the key overlay to open the Suggest Regex popup window. This will display the text captured inside the overlay.


Figure 28. Suggest Regex

  1. Click OK to use the suggested regex. This regex will be added to the Key field in the left column.

Note: If a key includes special characters that already hold meaning in a regex, such as slashes ( / \ ) or parentheses ( ), you will need to escape these characters with a backslash. For example, if a value regex is “SOCIAL SECURITY NUMBER (SSN)”, escaping the special characters (parentheses) would become “SOCIAL SECURITY NUMBER \(SSN\)”.

  1. Drag and resize the value (red) overlay to the text area that corresponds to the value. In the example above, the value field is the signature box.
  2. In the left column, set the Value field to the generic regex (.+).


Figure 29. Generic Regex

  1. Click Validate Regex.


Figure 30. Click Validate Regex

Configure Extraction Rule

Use the left panel to continue building the extraction rule.

  1. Keep Remove Lines selected.
    This option is recommended, as it removes any straight horizontal and vertical lines (such as the boundary around a signature) during the extraction process.
  2. Deselect Auto-Resize K/V Areas.
    This option is in beta and is not currently recommended for normal samples. When this option is selected, Ephesoft Transact will automatically detect and adjust for changes in image content size by adjusting the size and position of the key and value search areas. To use this option, draw the green overlay precisely over the key.

Important: Any additional configurable options on this page are not applicable to Signature Detection. If you make any changes (Fuzzy, Fetch, Page, Zone, Weight, Is Wrapped) they will not affect the extraction rule.

Validate, Test, and Apply Extraction Rule

  1. Click Test KV to perform a test extraction.


Figure 31. Click Test KV

Note: If you receive unexpected results or want to make any changes, click Back to return to the editor and adjust the extraction rule. Refer to Troubleshooting for help.

  1. Click Apply KV.


Figure 32. Click Apply KV to Save Changes

  1. This will return you to the KV Extraction Rule page. Click Apply to save this key-value rule to the index field.


Figure 33. Click Apply to Save the KV Rule

You have successfully set up a KV extraction rule for Signature Detection.

Checkbox Detection

  1. From the Extraction Method dropdown, select Checkbox Detection.
  2. Select the arrow ( △ ) at the bottom of the screen to expand the Advance KV Test area.


Figure 34. Upload Blank Document

  1. Upload a blank version of the document you are configuring.

Note: You may also be able to configure the rule using a filled document. To do so, you will need an empty checkbox somewhere on the form that is of the same size and shape of the checkbox you are configuring for extraction. Follow the same steps below, but use the empty checkbox to determine the Pixel Density %.

Edit KV Overlays

Use the overlays to define the key-value pair. The overlays are color-coded boxes which are superimposed on the uploaded files. These overlays are resizable and can be dragged across the image to select the portions of the file to extract.

  1. On the blank document, drag and resize the key overlay (green) to the text or label that will serve as the key. In the example below, the key is “19. AUTHORIZATION”.


Figure 35. Edit KV Overlays

  1. Click the key overlay to open the Suggest Regex popup window. This will display the text captured inside the overlay.


Figure 36. Suggest Regex

  1. Click OK to use the suggested regex. This regex will be added to the Key field in the left column.

Note: If a key includes special characters that already hold meaning in a regex, such as slashes ( / \ ) or parentheses ( ), you will need to escape these characters with a backslash. For example, if a value regex is “SOCIAL SECURITY NUMBER (SSN)”, escaping the special characters (parentheses) would become “SOCIAL SECURITY NUMBER \(SSN\)”.

  1. Drag and resize the value (red) overlay to the text area that corresponds to the key. In the example above, the value field is the empty checkbox.

Important: If your checkboxes vary in size or location, you may need to increase the overlay size after testing.

Configure Extraction Rule

  1. In the left column, set the Value field to the generic regex (.+).


Figure 37. Generic Regex

  1. Click Validate Regex to confirm the value regex is a usable regular expression.


Figure 38. Click Validate Regex

Configure Pixel Density % and Test

The Pixel Density % is a configurable number (0-100) that determines the threshold for what percentage of the area within the value overlay must be filled for the checkbox to be considered selected. If this percentage is met, the checkbox detection will return as “True”.

This number varies depending on the location and size of the checkbox, as well as the size of the mark within the checkbox. To determine this number, we first need to test on a blank checkbox.

  1. On the blank document, set the Pixel Density % to 0.
  2. Select Test KV.

Important: Any additional configurable options on this page are not applicable to Checkbox Detection. If you make any changes (Fuzzy, Fetch, Page, Zone, Weight) they will not affect the extraction rule.


Figure 39. Click Test KV

  1. In the Advance KV Test panel, locate the column labeled Confidence %. This is the pixel density percentage of your empty checkbox.


Figure 40. Blank Checkbox Pixel Density %

  1. Click Back to return to the editor.
  2. Set the Pixel Density % to a number slightly higher than your blank checkbox pixel density. In the example above the blank checkbox is about 20, so we set our Pixel Density % to 25.
  3. Optional: Test extraction on a filled document.
    1. Return to your completed document by clicking the dropdown at the top of the page and selecting the appropriate document.


Figure 41. Return to Completed Document
Note: If you are working with a multi-page document, ensure you are viewing the relevant page.

    1. Without making any changes, click Test KV.
    2. Verify that your filled checkbox is extracted properly. In the Advance KV Test panel:
      1. The Value column should return as “true”.
      2. The Confidence % column should have a number higher than your defined Pixel Density %.


Figure 42. Filled Checkbox Pixel Density %
Note: If you receive unexpected results or want to make any changes, click Back to return to the editor and adjust the extraction rule. Refer to Troubleshooting for help.

  1. Click Apply KV.


Figure 43. Click Apply KV to Save Changes

  1. This will return you to the KV Extraction Rule page. Click Apply to save this key-value rule to the index field.


Figure 44. Click Apply to Save the KV Rule

You have successfully set up a KV extraction rule for Checkbox Detection.

Rule ID

Transact versions 2022.1.00 and above display the Rule ID column, which is a unique number for each extraction rule. This allows you to locate the specific rule within the batch.xml file.

Figure 45. KV Extraction Rule ID column

Troubleshooting

This section details common troubleshooting tactics for ICR and OMR-type extraction. This is limited to the following extraction methods:

  • Hand/Machine Print
  • Signature Detection
  • Checkbox Detection
Issue Troubleshoot
Inaccurate results when testing extraction. Refer to and analyze the snippet and RSP file for potential insights. These files are located in the snippets folder for your batch class.

Go to Folder Management > SharedFolders > [your batch class] > test-advanced-extraction > snippets > [your document].

Inaccurate results when testing extraction with the following configuration:

  • Extraction Method: Hand/Machine Print
  • Extraction Type: KV + Hand/Machine Print
You may need to increase the Minimum KV Confidence.

If this number is lower than the OcrConfidence listed in the HOCR.xml file, only OCR extraction will occur, which may give inaccurate results with hand-printed text. If this number is higher than your OcrConfidence, a second pass using ICR extraction will occur.

Refer to Configure Extraction Rule for steps on how to find the OcrConfidence.

When testing extraction, the wrong value is pulled from the page. Make sure your Key regex is unique.
When testing extraction, my value overlay is largely offset from its original location. De-select Auto-Resize KV Areas.

This beta algorithm may help to normalize skewed or rotated images but may cause an offset with documents of normal quality.

For Hand/Machine Print extraction, selecting Numeric as the Value Type still allows for alphabetic results to be returned. Only use Numeric for values that you expect to always be numbers. This issue was resolved in Transact 2020.1.03.

 

Articles