Transact

⌘K
  1. Home
  2. Transact
  3. Features and Functions
  4. Administrator Role and Fe...
  5. Modules and Plugins
  6. Extraction Module
  7. Machine Learning | Support for Multilingual Files

Machine Learning | Support for Multilingual Files

What’s New In Transact 4.5?


Machine Learning | Support for Multilingual Files

In Ephesoft Transact v.4.5.0.0, machine learning functionality has been enhanced to provide multilingual support. Now, you can use machine learning to extract data from documents written not only in English but in any language supported by the underlying Transact OCR engine (RecoStar for Microsoft Windows and Nuance for Linux). The system will pick up the languages specified by the <LanguageCode> tag in the HOCR.xml file and will use this data to learn and extract values from the document.

Note: The default machine learning dictionaries included in the installer are currently provided only for English. If you need a dictionary for any other language, you can create it at the time of data extraction on the Validation screen (the overview of the process is provided below).

 

To use machine learning for multilingual files:

1. Create or open a Batch Class.

2. Create a Document Type.

3. Create and configure the Index Fields.

4. Navigate to the RECOSTAR_HOCR plugin or NUANCE_HOCR plugin in the Page Process module and select the language in the OCR Country/Language field.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word10.png

You can select one or several languages using a semicolon (;). Once a semicolon is typed, the list of available OCR languages will be displayed. The language suggestions list contains all the languages currently supported by the application.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word17.png

Note: If you do not specify the language in the HOCR plugin, English will be used by default.

Now, every time you run a batch using this Batch Class, the HOCR.xml file will contain the <LanguageCode> tag with the code of the OCR language specified in the RECOSTAR_HOCR or NUANCE_HOCR plugin.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word22.png

5. Go to the Extraction module, add the MACHINE_LEARNING_BASED_EXTRACTION plugin and click Apply to save your changes.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word13.png

6. Navigate to the MACHINE_LEARNING_BASED_EXTRACTION plugin configuration screen, turn the Machine Learning Based Extraction Switch ON and click Apply.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word14.png

Note: To enable machine learning for tables, turn ON the Machine Learning Based Table Extraction Switch as well.

7. Go to the Upload Batch screen and run the batch.

8. If any Index Field is not extracted properly, the batch will stop at the Validation stage.

Open the Validation screen and perform machine learning:

  • Place your cursor in the text box of the index field to be learned in the middle pane of the Validation screen.
  • On the image view pane of the Validation screen, click on the area of the image where the index field is located. An overlay appears on the image and the text box is populated with the index field value.
  • Click on the overlay to open the Suggestion View window.
  • Select a predefined regex or create a new regex and click OK.

Now, the data is machine learned in a language or languages defined in the HOCR.xml file.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word21-1.png

Note: If you are using machine learning to extract data from a document which contains data in multiple languages, the extraction results might not be consistent.

On the Validation screen, you can also add custom dictionaries containing data in various languages.

 

To create a custom dictionary for a specific language:

On the Validation screen:

  • Click on the overlay created for the Index Field to open the Suggestion View window.
  • Select the Create Type option and in the Type dropdown list select Dictionary.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word15-1.png

  • Define the Type Name and add as many values for the dictionary as required by using the button. Use button to delete any value.
  • Click OK to save the custom dictionary.

Now, your dictionary is added to the list of default dictionaries in .txt format. You can find it in the dictionaries folder at the following location: EphesoftSharedFoldersBC{Id}machine-learning-dictionariesknowledge-basedictionaries. This custom dictionary file will contain all the values added on the Validation screen.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word16.png

 

To import a custom dictionary for a specific language:

  • Navigate to the Batch Class Management screen and select your Batch Class.
  • Go to machine-learning-dictionaries > knowledge-base > dictionaries.
  • In the Upload Machine Learning Dictionary(s) section, click Select Files or drag and drop the file containing the dictionary into the specified area.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word23-1.png

The dictionary is imported successfully. Since we are importing the dictionary manually, the following message is displayed:

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word31.png

To make changes in the mappings file:

  • Navigate to the Folder Management section and select your Batch Class.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word24.png

  • Go to the dictionaries folder (machine-learning-dictionariesknowledge-basedictionaries) and select the dictionary_mappings_properties file.
  • Click Edit.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word29.png

  • Provide the following information to perform the dictionary mapping:
Field Description
Key Define the Dictionary name (e.g. German_Names). This name will appear in the Predefined Types list in the Suggestion View window on the Validation screen.
Value Define the dictionary text file (e.g. German_Names.txt) and provide the Display value:

0 = do not display the Dictionary Type in the Suggestion View window on the Validation screen

1 = display the Dictionary Type in the Suggestion View window on the Validation screen

 

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word30.png

  • Click Save to save your changes.

Machine learning dictionaries and regex can also be modified in the Folder Management section.

 

To customize dictionaries and regex for a specific language in the Folder Management screen:

  • On the left menu panel, select Folder Management and double-click on the selected Batch Class.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word24.png

  • Go to the knowledge-base folder (SharedFoldersBC{Id}machine-learning-dictionariesknowledge-base) to find all stored dictionaries and regex.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word25.png

  • In the dictionaries folder, double-click any dictionary to see its entries. Here, you can add, delete and edit values as required.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word26.png

  • Click Save to save your changes.
  • In the regex folder, open the regex.txt file to see the list of all pre-defined regular expressions.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word27.png

  • Here, you can add, delete and edit values as required.

C:UsersEphesoftAppDataLocalMicrosoftWindowsINetCacheContent.Word28.png

  • Click Save to save your changes.