KB00007775: Issue with Table Extraction Using the Column Header and Column Coordinates

January 22, 2015

Issue

Trouble extracting data from a table using the Column Header and Column Coordinates.

Root Cause

Column Headers are highly dependent on the recognition of the Column Header Pattern in the OCR. Variations in the OCR can cause the table or column not to be extracted properly or complete rows to be skipped. For example, If you configured the extraction rule to look for a column named “Part Number” but the OCR value is “Pert Number” the column or table will not extract properly.

Column Coordinates will take the values identified based on a zonal pattern and extract the contents within. If your zone coordinates are not defined properly you may get values pertaining to the column next to it as well.

Solution

To resolve the issues regarding the recognition of Column Headers, you may need to account for variances in the OCR.

Here are some potential solutions:

Try using a different image compression in your import settings (Group4 vs. LZW).
Try a higher DPI for quality retention of the image during batch processing (300 – 600 DPI).
Account for variances in the OCR by using regular expressions.

For example, for a specific Column Header Pattern like “Part Number”, use a more generic Regex such as “P[A-z0-9\s]{7}ber”. This finds any variation of alphanumeric values that start with a “P” and end with “ber”.

To resolve any issues with the Column Coordinates, you may need to simply adjust your zonal areas so they fit and account for variations in the images. Variations include:

Skewed coordinates.
Changes in resolutions and overall image size.

To ensure the best results you should try to standardize your input images and have a minimum quality requirement. For example, Resolution: 2550×3300, DPI: 300.