Transact

  1. Home
  2. Transact
  3. Features and Functions
  4. Administrator Role and Features
  5. Batch Class Management
  6. Table Extraction
  7. Configuring Table Extraction

Configuring Table Extraction

Applies to: Ephesoft Transact 4.1.0.0 and above

This page describes how to configure a document type for table extraction. These steps are intended for Transact administrators.

Prerequisites

To configure table extraction, the following prerequisites must be in place:

  • You will need a batch class with a document type configured. For detailed steps, refer to Add New Document Type.
  • The TABLE_EXTRACTION plugin must be added to the Extraction module and turned on.

Overview

The process to configure a document type for table extraction follows these basic steps:

  1. Define the Table
  2. Define the Table Columns
  3. Configure Table Extraction Rules
  4. Test Table Extraction

Define the Table

First, you will need to create a table for your document type.

To create a table:

  1. From the Batch Class Management screen, select and open your batch class.
  2. Go to Document Types > <your document type> > Tables.
  3. Click Add.

Figure 1. Add New Table

A new entry will be added to the grid.

Figure 2. New Table

  1. Enter an intuitive Name for the new table.
  2. Configure the remaining fields according to your workflow needs. Refer to the table below for help.
Field Description
Validation Rule Operator Specifies whether all Table Validation Rules must be satisfied for a field to be validated (AND), or only one rule (OR).
Remove Invalid Rows When selected, any rows from the extracted data that do not match the validation rule will be removed.

Note: For best results, keep this unchecked during your initial testing, then consider enabling it depending on your test results. This is because documents with lower OCR quality may cause matching rows to be incorrectly removed.

Currency Specifies the currency format that should be applied to the extracted data. Refer to Configuring Currency Settings for Tables for more information.
Table Cell Value Change Script When selected, enables the Table Cell Value Change Script.
  1. Click Apply.

Define the Table Columns

After defining the table, you will need to define the columns within the table.

To configure table columns:

  1. Open the newly created table from the Tables page. This will open the Table Columns page.
  2. Click Add to create a new column.
  3. Enter a Column Name and a Description.

Figure 3. Add New Columns

  1. Enter the Column Order. This is the order of the column within the table, left to right, starting with “1”.
  2. Configure the remaining fields according to your workflow needs. Refer to the table below for help.
Field Description
Validation Pattern Enter a validation pattern for the new table column name, using either the Regex Pool or the Regex Builder widgets.
This field is optional.
Alternate Values Enter a list of alternate values for a particular column. This is used to provide a drop-down of commonly used values to operators while validating tables. This drop-down will only be visible for blank cells (fields for which there was no extracted value).

This field is optional.

OCR Confidence Threshold A number between 1 and 100, defines the minimum OCR confidence for a value to be validated automatically. If the OCR confidence is below this level, it will be marked for operator validation. You may need to test and refine this number for best results.

The default value is 90.

Default Value If no value is extracted for a row in this column, it will be replaced with the default value specified here.

This field is optional.

Additional Configuration New Row Anchor Select this checkbox to indicate the start of a new row if a value from this column is extracted.
Required Select this checkbox to mark the column as mandatory for each row. If the row does not have a value from this column, the row is discarded.
Currency Select this checkbox for columns containing only currency.
Hidden Select this checkbox for columns containing only currency.
  1. Repeat steps 2–5 for the remaining columns in the table.
  2. Click Apply.

Configure Table Extraction Rules

After defining the table columns, you will need to configure the extraction rule for the table.

To configure a table extraction rule:

  1. In the left navigation, click Table Extraction Rules.
  2. Click Add.

The Extraction Rule Configuration screen displays.

Figure 4. Extraction Rule Configuration Screen

  1. If no preview image appears, click Select Files in the bottom panel to select and upload the sample file.

Extraction Rule

First, you will need to specify the basic configurations for the rule in the Extraction Rule box. Refer to the table below for information on each configurable option.

Figure 5. Extraction Rule

Field Description
Rule Name Enter a name for the extraction rule.
Start Pattern Defines the starting point of the table using a regular expression. This must be unique across all extraction rules in a document type. To configure the start pattern, enter a valid regex in the Start Pattern field, or use the provided overlay:

  1. Move and resize the grey Start Pattern overlay to the starting point of the table.
  2. Click the overlay to display the Suggest Regex box.
  3. Use a suggested regex from the Regex dropdown, select a Predefined Type, or enter your own regex.
  4. Click OK.


Figure 6. Start Pattern

End Pattern Defines the end point of the table using a regular expression. To configure the end pattern, enter a valid regex in the End Pattern field, or use the provided overlay:

  1. Move and resize the yellow End Pattern overlay to the end point of the table.
  2. Click the overlay to display the Suggest Regex box.
  3. Use a suggested regex from the Regex dropdown, select a Predefined Type, or enter your own regex.
  4. Click OK.

Figure 7. End Pattern

Extract Repeating Tables Select this option if the table may span multiple pages.
Overlapping Columns Table Select this option if the headers of two columns may overlap one another.
2-Column Layout Select this option if your table is split into two columns on the same page. For more information on this option, refer to Table Extraction for 2-Column Layout.
Table Extraction API Defines the table extraction methods and their operators.

  • Column Coordinates: Extract data based on the defined column coordinates. If you select this API, you must use the overlays to define the coordinates.

Note: If you choose to use the Column Coordinates extraction method, the start value will be set to the lower of the two Column Header and Column Pattern values, and the end value will be set to the higher of the two.

  • Column Header: Extract data based on the defined column headers.
  • Regex Extraction: Extract data based on the defined regex patterns.

If you select more than one extraction method, you will also need to define the rule operator. Rule operators apply to the field to their left.

  • AND: Performs extraction based on all selected methods. This will result in a stricter extraction experience.
  • OR: Performs extraction based on one of the selected methods. This will result in a looser extraction experience.

Note: For best results, begin your configuration process using a single extraction method. This is to make sure that you can successfully extract data from the table, regardless of the quality or accuracy. If further refinement is needed, add more operators.

Extract Repeating Tables Check Box

The Extract Repeating Tables check box can be used to extract data from tables that span multiple pages. If the first page and last page are both unique, you can use a unique Start Pattern value from the first page, and a unique End Pattern from the last page. However, some forms require you to duplicate the same page as many times as is necessary to submit the form. As a result, you can’t use an End Pattern from the bottom of the page because it will prevent Transact from continuing to the next page to read the continuation of the table.

To get around this, define a Start Pattern from the first page as you normally would, but instead of defining an End Pattern from the bottom of the page, choose an End Pattern using a string from the top of the page, and make sure that the Extract Repeating Tables checkbox is checked. Now, when Transact processes a multi-page document like this, when it gets to the bottom of the first page, it will continue to the following page and stop reading when it reaches the End Pattern value. However, because Transact is processing the second page now, it will continue reading until it finds the Start Pattern value again. If you use Column Coordinates and Regex Extraction as your table extraction methods, Transact will ignore the text at the bottom of the document and not include it in your table extraction results.

Column Configuration

Next, you will need to specify the extraction rules for each table column in the Column Configuration box. Ensure you have defined the table columns before proceeding.

Note: You can collapse the Extraction Rule section to get a better view of the Column Configuration section.

Figure 8. Column Configuration

Configuring table column rules follows these general steps:

  1. Select a Table Column from the drop-down. The first column will be selected by default.
  2. Configure the remaining fields.
  3. Repeat steps 1–2 for each column in the table. All Column Configuration options are tied to the currently selected Table Column.

Refer to the table below for information on each configurable option.

Field Description
Table Column Select the column for which you are defining the column extraction rule. You will only be able to select the columns added in Define the Table Columns above.
Column Header Pattern This field is only required if the Table Extraction API is set to Column Header or Column Coordinates.

Defines the header of the selected column using a regular expression. To configure the column header pattern:

  1. Move and resize the green Column Header overlay to the header text.
  2. Click the overlay to display the Suggest Regex box.
  3. Use a suggested regex from the Regex dropdown, select a Predefined Type, or enter your own regex.
  4. Click OK.

Important: When using the Column Coordinates API, ensure the overlay spans the full possible width of each column, as this will define the Start Coordinates and End Coordinates.

Figure 9. Column Header Pattern

Column Pattern Defines the expected data within a selected column using a regular expression. To configure the column pattern, enter a valid regex in the Column Pattern field, or use the provided overlays:

  1. Move and resize the red Column Data overlay to a cell in the respective column.
  2. Click the overlay to display the Suggest Regex box.
  3. Use a suggested regex from the Regex dropdown, select a Predefined Type, or enter your own regex.
  4. Click OK.

Figure 10. Column Pattern

Pattern Left Defines the expected data to the left of a selected column using a regular expression. If you’re using the Regex Extraction API and experience issues with data being extracted from the wrong location, you can use this option to better direct the table extraction.

Note: The expected data must have a unique, reliable pattern—such as a date or part number.

To configure the left pattern, enter a valid regex in the Pattern Left field, or use the provided overlay:

  1. Move and resize the orange Pattern Left overlay to the area on the left side of the column data.
  2. Click the overlay to display the Suggest Regex box.
  3. Use a suggested regex from the Regex dropdown, select a Predefined Type, or enter your own regex.
  4. Click OK.
Pattern Right Defines the expected data to the right of a selected column using a regular expression. If you’re using the Regex Extraction API and experience issues with data being extracted from the wrong location, you can use this option to better direct the table extraction.

Note: The expected data must have a unique, reliable pattern—such as a date or part number.

To configure the right pattern, enter a valid regex in the Pattern Right field, or use the provided overlay:

  1. Move and resize the purple Pattern Right overlay to the area on the right side of the column data.
  2. Click the overlay to display the Suggest Regex box.
  3. Use a suggested regex from the Regex dropdown, select a Predefined Type, or enter your own regex.
  4. Click OK.
Start Coordinate This field only applies if the Table Extraction API includes Column Coordinates.

The Start Coordinate is automatically configured based on the position of the Column Header or Column Data overlays.

Note: This is set automatically when the Column Header or the Column Data overlays are positioned. It is not necessary to use both. If both are selected, the Start Coordinate will be set to the lowest (left-most) value of both overlays, and the End Coordinate will be set to the highest (right-most) value of both overlays.

End Coordinate This field only applies if the Table Extraction API includes Column Coordinates.

Note: The End Coordinate is automatically configured based on the position of the Column Header or Column Data overlays.

Extract Data From Column This field allows you to pinpoint a portion of the column and extract data from that specific area. It functions as a “search within a search”.
Additional Configurations Use Default Column Configuration Keep this checkbox selected to use the default values of the column, as configured above in Define the Table Columns

If this checkbox is cleared, it will enable the remaining Additional Configurations:

  • New Row Anchor
  • Required
  • Currency
New Row Anchor Select this checkbox to indicate the start of a new row if a value from this column is extracted.

For best results, enable this feature for columns that have a single value that does not wrap, which will tell the extraction to create a new row each time a new value is found for this column.

For example, if the Part Number field always has a single value that doesn’t wrap, but Part Description often wraps for several lines, enabling the New Row Anchor for Part Number will tell the system to capture wrapping data for the other columns in the row until a new value is found for Part Number.

Note: This field is disabled if Use Default Column Configuration is selected.

Required Select this checkbox to mark the column as mandatory for each row. If the row does not have a value from this column, the row is discarded.

Note: This field is disabled if Use Default Column Configuration is selected.

Currency Select this checkbox for columns containing only currency.

Note: This field is disabled if Use Default Column Configuration is selected.

Test Table Extraction

After you’ve configured the applicable fields for each Table Column, you will need to test the table extraction. To do so, perform the following steps.

  1. If needed, click Validate Regex to check any manually inserted regular expressions.
  2. Determine your testing method based on what kinds of documents you expect to receive in production:
    • OCR: If you expect to receive documents that won’t contain an E-Text layer, select OCR from the top-right drop-down.

Figure 11. Select OCR

    • EText: If you expect to receive some documents that contain an E-Text layer (such as documents that were created using a “print to PDF” feature), and you’ve configured your batch class to use the E-Text feature, select EText from the top-right drop-down.

Figure 12. Select EText

  1. Click Test Table. The extracted results will display in the Test Table Results panel.

Figure 13. Test Table Results

  1. Review the test extraction results. The test extraction works best for single page tables. If you are testing tables that span multiple pages, run test batches through Transact and view the results in the Validation screen.

Note: Any fields highlighted in red indicates low OCR confidence or validation errors for extracted data. In the example above, you can see some rows are highlighted because the OCR confidence (18) is lower than the default minimum OCR confidence (90). Because these highlighted results were extracted correctly and this sample is of high OCR quality, this indicates that the OCR Confidence Threshold for the column should probably be lowered.

  1. Click Close to exit out of the Test Table Results. You can continue to test and adjust your configuration as necessary.
  2. When you’re satisfied with your results, click Apply.
  3. Click Apply again to save the rule.

Figure 14. Apply the Rule

Note: You may need to create multiple sets of extraction rules depending on the variations of documents that may be processed. For example, an invoice from two vendors may be fundamentally the same, but will have different layouts. When incoming documents are processed, Transact will execute all table extraction rules against every document, with the Start Pattern value being used to determine whether the row-level extractions should take place.

Conclusion

This completes an overview of how to configure table extraction in Ephesoft Transact.

Tags
Was this article helpful to you? Yes No