Machine Learning Improvements: Machine Learning for Invoices and Enhanced Machine Learning

Introduction

Core Machine Learning for Invoices

Enhanced Machine Learning

Troubleshooting

Web Services

Introduction

In previous versions of Transact, machine learning heavily relied on the input/rules provided by the operator. The system learned and saved the values entered on the Validation screen based on the anchors after the user specified the corresponding regex, dictionary or composite type in the Suggestion Box pop-up window.

In Ephesoft Transact v.2019.1, the approach to machine learning has been majorly revised. The new core machine learning is designed using the latest technology and helps to achieve better extraction accuracy as well as shorter “time to value”. Based on a much more focused approach, the new mechanism has been created only for Invoice batch classes. The entire process is greatly simplified and improved – the system now automatically identifies the document type, and then searches and finds configured key-value pairs as explained in detail below.

Apart from the core machine learning, there is also the enhanced machine learning (EML), which can be used for any batch class. This mechanism allows the user to learn the data entered on the Validation screen. The operator only needs to populate the index fields and Transact will automatically create and save extraction rules and ML value types.

Core machine learning is designed only for Invoice batch classes as mentioned above. It includes several pre-configured ML fields, which can be mapped against the index fields created by the operator. Once the mapping is done, the system automatically extracts the values based on the in-built learning.

To implement this functionality:

– The Invoice category has been added under the Category column on the Batch Class Management screen.

– The ML Field Mapping column has been added to the Index Fields screen (Batch Class -> Document Types -> Index Fields).

– The new dictionaries folder has been created under the SharedFolders at the application level. It contains files with synonyms (“mllocale” folder) and regular expressions (“mlregexes” folder) created for each pre-configured ML index field type for each supported language.

The core machine learning mechanism can automatically detect the language of the HOCR file. The following languages are currently supported under the core machine learning for invoices (one or more languages can be present on a single invoice):

– English

– French

– German

Enhanced machine learning can be used for any batch class and is designed to learn values provided by the operator on the Validation screen. The learned data is saved in a JSON file in the enhanced-machine-learning folder (SharedFolders<Batch Class ID>enhanced-machine-learning<Document Type>) to be used for further batches.

To implement this functionality:

– The ENHANCED_MACHINE_LEARNING_EXTRACTION plugin has been created under the Extraction module with ON/OFF switch.

In current implementation, the core machine learning functionality is provided only for the Invoice category and only for index fields. More documentation categories as well as support for Invoice table machine learning will be added in future releases.

Important: The machine learning approach used earlier for index field extraction has been discontinued. If you are importing batch classes with machine learning data from previous versions of Transact into Transact v.2019.1, the learned data will be redundant due to the changed ML mechanism.

Machine learning for tables did not undergo any changes and will work as before. The data imported along with batch classes from previous Ephesoft Transact versions will be saved and used as intended. Machine learned table data does not require re-learning after the import.

Core Machine Learning for Invoices

Prerequisites

– The MACHINE_LEARNING_BASED_EXTRACTION plugin should be added to the Extraction Module.

– On the Machine Learning Based Extraction Plugin screen, the Machine Learning Based Extraction Switch should be turned ON.

– Optionally, the ENHANCED_MACHINE_LEARNING_EXTRACTION plugin can be added to the Extraction module if you want to be able to learn values on the Validation screen. By default, the plugin switch is ON.

To perform core machine learning for invoices:

1. On the Batch Class Management screen, create a batch class.

2. On the same screen, select Invoice in the Category drop-down list, and click Apply to save the changes.

3. Navigate to the Document Types screen:

– Click Doc Type to add a new document type.

– Upload Learn File for classification.

– Assign Classification and Extraction Roles as required (for more information, refer here) and click Apply.

4. Navigate to the Index Fields screen:

– Click Add to add new index fields.

– Map created index fields against available parameters in the ML Field Mapping drop-down list. This list is populated automatically once Invoice category is selected on the Batch Class Management screen. Currently, there are 12 parameters available:

  • INVOICE_NUMBER
  • INVOICE_DATE
  • PURCHASE_ORDER
  • DELIVERY_DATE
  • PAYMENT_TERMS
  • TAX_RATE
  • NET_AMOUNT
  • TAX_AMOUNT
  • TOTAL_AMOUNT
  • CURRENCY
  • VAT_NO
  • INVOICE_TYPE (Invoice or Credit Note)

– Click Apply to save the changes.

Note: If at some point you decide to change the category of your batch class on the Batch Class Management screen from Invoice to None, all existing index field ML mappings will be deleted and the ML Field Mapping drop-down on the Index Fields screen will be disabled. The following confirmation message will be displayed as well:

5. Navigate to the Upload Batch screen, upload the file(s) for processing and click Start Batch.

The machine learning will extract configured values automatically. Extracted values are saved in the batch.xml file (see the final-drop-folder under the SharedFolders).

Enhanced Machine Learning

Prerequisites

– The ENHANCED_MACHINE_LEARNING_EXTRACTION plugin should be added to the Extraction module.

– The ENHANCED_MACHINE_LEARNING_EXTRACTION plugin switch should be ON.

To perform enhanced machine learning:

1. On the Batch Class Management screen, create a batch class and click Apply to save the changes.

2. Navigate to the Document Types screen:

– Click Doc Type to add a new document type.

– Upload Learn File for classification.

– Assign Classification and Extraction Roles as required (for more information, refer here) and click Apply.

3. Navigate to the Index Fields screen, add index fields and click Apply.

Note that since this is not an Invoice batch class, the ML Field Mapping drop-down is disabled.

4. Navigate to the Upload Batch screen, upload the file(s) for processing and click Start Batch.

5. When the batch stops at the Validation stage, simply populate index fields with values using image overlays, and click Validate.

The learning happens once the Validate button is clicked. The system creates learning data for the changed fields and saves it in JSON files in the enhanced-machine-learning folder for the corresponding document type at the batch class level (e.g. EphesoftSharedFoldersBC9enhanced-machine-learningW2).

In our example below, W2 is the document type and FT10, FT11 and FT12 are JSON files created for three configured and machine learned index fields.

When similar files are processed again, EML should fetch all three values using saved learning data.

 

Troubleshooting

If the core machine learning for invoices fails to extract required values, you can check the synonyms file in the corresponding language and add synonyms as needed.

Suppose, your invoice PDF files have the field “Invoice issue date”. You’ve created an index field and mapped it against the INVOICE_DATE field type, however the value is not extracted. In this case, you can open the synonyms file used for the English language and check the list of values used for this index type (SharedFoldersdictionariesmllocaleenLC_MESSAGESsynonyms.po).

Since there is no “invoice issue date” currently included in the synonyms list for this index type, you can go ahead and add it. For that, just use two semicolons (;;) and type in required value (preferably, no capital letters and no spaces). Save the synonyms.po file. The next time you execute this batch, the value should be matched and extracted.

 

Web Services

Note that machine learning web services and SDK are suspended in release 2019.1.

The following machine learning web services are no longer available:

  • /dcma/rest/generateMachineLearningModel
  • /dcma/rest/generateBatchClassLearningModel
  • /dcma/rest/machineExtraction
  • /dcma/rest/machineExtractionForBatchClass

Machine Learning support is removed from the following web services:

  • /dcma/rest/OcrClassifyExtractSearchablePDF
  • /dcma/rest/initiateOcrClassifyExtract
  • /dcma/rest/ocrClassifyExtract
  • /dcma/rest/v2/ocrClassifyExtract
  • /dcma/rest/v2/ocrClassifyExtractBase64