Batch.xml and XSD Schema for Developers

Introduction

This reference document provides information for developers who are looking to add additional functionality or customize their solution beyond a regular installation of Ephesoft Transact. This document applies to Ephesoft Transact 2019.1 and above.

  • Developer tasks could entail web services, scripting, interoperation between Ephesoft Transact and external applications or other custom integrations of Ephesoft Transact.
  • This document describes the layout and content of the batch.xml file so that this metadata can be consumed somewhere else.
  • With the information contained in this document, a developer can write a workflow script that can manipulate this batch-level information.

Overview of Batch.xml Schema and XSD

The batch.xml file schema and matching XSD contain metadata and multi-level information for every batch processed in the Ephesoft Transact workflow. The batch.xml file contains metadata for each batch instance at the batch level, document level and page levels.

Field Levels in the Batch.xml Schema

The batch.xml file and XSD support the following fields levels. This hierarchy of fields applies to the batch.xml schema for each batch instance that has begun the workflow process:

  • Batch-level Fields — The fields on this level apply to the entire batch instance as a whole.
  • Document Fields — The fields on this level apply to all documents in the batch instance.
  • Document-level Fields — The fields on this level apply to individual documents within the batch instance.
  • Page Fields — The fields on this level apply to all pages within the batch instance.
  • Page-Level Fields — The fields on this level apply to each individual page within the batch instance.
  • Email Metadata in the Batch.xml Schema — Email heading metadata is available on multiple levels of the batch.xml schema for any batch instance that uses email import.

Batch-Level Fields

Refer to the following table for the document-level fields in the Ephesoft batch.xml schema and XSD.

Note: For batch-level fields in the batch.xml schema, refer to Email Metadata in the Batch.xml Schema.

Batch-Level Field Name Description Module Plugin

BatchInstance
Identifier

This value is the Identifier column in the batch_instance table. Each batch in Ephesoft Transact has a unique batch identifier. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

BatchClass
Identifier

This value is the Identifier column in the batch_class table. Each batch in Ephesoft is run under a batch class that is a single unit for all configurations and workflow definitions. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

BatchClass
Name

This value is the batch_class_name column in the batch_class table. Each batch in Ephesoft Transact is run under a batch class that is a single unit for all configurations and workflow definitions. A foreign key relation is established between the ID column of the batch_class table and the batch_class_id column in the batch_instance table. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

Signature

This value is added only when batch encryption is enabled. This is used to ensure that the batch.xml file can only be read and updated by Ephesoft Transact. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

BatchClass
Description

This is the value of the batch_class_description column in the batch_class table. Each batch in Ephesoft Transact is run under a batch class that is a single unit for all configurations and workflow definitions. A foreign key relation is established between the ID column of the batch_class table and the batch_class_id column in the batch_instance table. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

BatchClass
Version

This is the version number of the batch class under which the batch was processed. This is the value of the batch_class_version column in the batch_class table. Each batch in Ephesoft Transact is run under a batch class that is a single unit for all configurations and workflow definitions. A foreign key relation is established between the ID column of the batch_class table and the batch_class_id column in the batch_instance table. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

BatchName

This is the value of the batch_name column in the batch_instance table. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

Batch
Description

This is the batch instance description that is provided at the time of batch creation. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

BatchPriority

This is the value of the batch_priority column in the batch_instance table. Priority can be a value between 1 to 100 with the lower number having higher priority. If not assigned using custom code the batch priority will be the priority from the batch class which is assigned when the batch class is created or imported. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

BatchStatus

This value specifies the current batch status. Possible values for this field are:

New

Ready

Running

ReadyForReview

ReadyForValidation

Error

Finished

All modules can modify this field All plugins can modify this field

Batch
ReviewedBy

This field cites users who performed batch review and enables administrators to audit users who are active in the batch instance. This node applies at the batch level and the individual page level.

  • Batch level — This node indicates a single user who completed the batch.
Page Process SEARCH_
CLASSIFICATIONMULTIDIMENSIONAL_
CLASSIFICATION_PLUGIN

Batch
ValidatedBy

This field cites users who performed validation and enables administrators to audit all users who are active in the batch instance, and this node applies at the batch level and the individual page level, with these parameters.

  • Batch level — This node indicates a single user who completed the batch.
Page Process SEARCH_
CLASSIFICATIONMULTIDIMENSIONAL_
CLASSIFICATION_PLUGIN

BatchCreation
Date

This is the value of the creation_date column in the batch_class table. This is the date and time of when the batch was created in Ephesoft Transact. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

BatchLocal
Path

This is the Ephesoft system folder path where the batch instance folder will be available. This value will be the same across all batches in the system. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

BatchSource

This value states which ingestion mechanism was used to import the batch into system. Possible values are:

Upload Batch

Web Scanner

Email Import

UNC Folder

Web Service

CMIS Import

Snapdoc

Folder Import Module IMPORT_BATCH_FOLDER

UNCFolder
Path

This is the patch where the source file for the batch is available. This is a unique patch for each batch in the system. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

ETextMode

The batch.xml will have an additional tag, ETextMode, at root level to define the EText mode with values of Automatic, Always, and Never. Value will be populated based on the plugin value of the following:

For Windows, use the EText Recostar Project property value in Recostar_HOCR plugin of the Page Process module.

For Linux, use the Process PDFs as EText Files property value in Nuance_HOCR plugin of Page Process module.

Page Process Recostar_HOCR

Nuance_HOCR

Document
Classification
Types

This value specifies what classification type was used for the document assembly. Assembly DOCUMENT_ASSEMBLER

Document Fields

The document fields in the batch.xml schema apply to all documents in the batch instance. Document fields exist at a higher level than document-level fields.

Note: For document fields in the batch.xml schema, refer to Email Metadata in the Batch.xml Schema.

The following example illustrates the docField for a batch instance:

<xs:complexType name="docField">

     <xs:complexContent>

       <xs:extension base="field">

             <xs:sequence>

                    <xs:element name="AlternateValues" minOccurs="0"

                          maxOccurs="1">

                          <xs:complexType>

                                 <xs:sequence>

                                       <xs:element minOccurs="0" maxOccurs="unbounded" name="AlternateValue"

                                              type="field" />

                                 </xs:sequence>

                          </xs:complexType>

                    </xs:element>

                    <xs:element name="PreviousValue" type="docField"

                          minOccurs="0" maxOccurs="1" />

                    <xs:element name="Category" type="xs:string" minOccurs="0"

                          maxOccurs="1" />

                    <xs:element name="hidden" type="xs:boolean" minOccurs="0"

                          maxOccurs="1" />

                    <xs:element name="widgetType" type="xs:string"

                          minOccurs="0" maxOccurs="1" />

                    <xs:element name="scriptEnabled" type="xs:boolean"

                          minOccurs="0" maxOccurs="1" />

                    <xs:element name="Message" type="xs:string" minOccurs="0"

                          maxOccurs="1" />

             </xs:sequence>

       </xs:extension>

     </xs:complexContent>

</xs:complexType>

The following table lists and defines the fields contained in the docField section of the batch.xml schema.

Document Field Name Description Module Plugin

Alternate
Values

This field contains alternate values for a page-level field. This field stores alternative classification information with confidence levels.

In Ephesoft Transact 4.5.X.X and prior releases, during classification, a page can be classified into 10 different types. The type having highest confidence value will be set in the page-level field value. All other possible types for a page will be present in alternative values. This tag will contain the LearnedFileName tag for all the alternate values.

In Ephesoft Transact 2019.1 and later releases, during classification, the default value for this field is changed to 5.

Page Process SEARCH_
CLASSIFICATIONMULTIDIMENSIONAL_
CLASSIFICATION
_PLUGIN

Document-Level Fields

Refer to the following table for the document-level fields in the Ephesoft batch.xml schema and XSD.

Note: For document-level fields in the batch.xml schema, refer to Email Metadata in the Batch.xml Schema.

Document-Level
Field Name
Description Module(s) Plugin(s)

Identifier

This is the Document Identifier for a document. The sequence for document numbering is DOC0, DOC1 … DOCn

Note: Based on our current implementation where each file becomes a separate batch, there will be only one document in the XML after Folder Import. All pages will belong to this one document.

The page to document grouping will change after the Document_Assembler_Plugin within the Page Process module is executed. The pages may be grouped into multiple documents.

Folder Import

Document Assembly

IMPORT_BATCH_FOLDER
_PLUGINDOCUMENT_ASSEMBLER
_PLUGIN

Type

This is the document type assigned to the document.

Note: Based on the current implementation where each file becomes a separate batch, there will be only one document in the XML after Folder Import. All pages will belong to this one document and named Unknown. The page to document grouping will change after the Document_Assembler_Plugin within the Page Process module is executed. The pages may be grouped into multiple documents and the document as which the pages were classified is assigned against this tag. The document types that belong to the batch class assigned to the batch is available in the database table document_type (field – document_type_name). This table has a foreign key reference to the ID column of the batch_class table that associates documents to batch class.

Folder Import

Document Assembly

IMPORT_BATCH_FOLDER
_PLUGINDOCUMENT_ASSEMBLER
_PLUGIN

Description

This tag contains the corresponding document_type_description of the assigned document_type_name above. It is picked up from the database table document_type. Folder Import

Document Assembly

IMPORT_BATCH_FOLDER
_PLUGINDOCUMENT_ASSEMBLER
_PLUGIN

Size

This element contains the document’s multipage PDF file size in bytes. This field is populated only by the IBM_CM_PLUGIN. Export Module IBM_CM_PLUGIN

Confidence

This value is the confidence with which the document was assembled. If the confidence is greater than the minimum confidence threshold assigned to the document, then the document is not marked for operator review. Folder Import IMPORT_BATCH_FOLDER
_PLUGINDOCUMENT_ASSEMBLER
_PLUGIN

Confidence
Threshold

This field helps Ephesoft Transact to decide if the document should skip document review automatically when the classification score is higher than the threshold. The document confidence threshold is available in the table document_type (located in the field- min_confidence_threshold column). The best practice is to set the threshold so that false positives are minimized. Document Assembly DOCUMENT_ASSEMBLER
_PLUGIN

Valid

This tag determines if the document would stop for Document Field Validation review. This applies only when data extraction is part of the batch class. The value of False indicates that the document has fields that need to stop for Document Field Validation review. The value of True indicates that all fields in the document were extracted with high confidence and need not stop for Document Field Validation review. The value is set to True after execution of Review_Document_Plugin if the extraction module is not configured in the batch class. Document Assembly

Review Document

DOCUMENT_ASSEMBLER
_PLUGINREVIEW_DOCUMENT
_PLUGIN

Reviewed

This tag determines if the document would stop for document classification review. The value of False indicates that the document was assembled/classified with low confidence and needs to stop for document classification review. The value of True indicates that the document was assembled/classified with high confidence and need not stop for document classification review. The value is set to True after execution of Review_Document_Plugin. Document Assembly

Review Document

DOCUMENT_ASSEMBLER
_PLUGINREVIEW_DOCUMENT
_PLUGIN

ReviewedBy

This field lists users who performed review in the batch. This field enables administrators to audit users who are active in the batch instance. This node applies at the batch level and the individual page level:

Page level — This node accommodates multiple users, if multiple users processed pages in the batch.

Refer to Case Study — ReviewedBy Node in Batch.xml.

Page Process SEARCH_
CLASSIFICATIONMULTIDIMENSIONAL_
CLASSIFICATION_PLUGIN

ValidatedBy

This field lists users who performed validation for the batch. This node enables administrators to audit all users who are active in the batch instance, and this node applies at the batch level and the individual page level, with these parameters:

Batch level — This node indicates a single user who completed the batch.

Page level — This node accommodates multiple users when multiple users processed pages in the batch.

Page Process SEARCH_
CLASSIFICATIONMULTIDIMENSIONAL_
CLASSIFICATION_PLUGIN

ErrorMessage

A string that contains error message to be displayed on the Review and Validate screen corresponding to a document. The value of this tag can be set using a scripting plugin. Ephesoft Transact does not set a value for this field. Review/ Validation Review/ Validation

Document
DisplayInfo

This field can be used to provide customized names to documents on the Review and Validate screen. The value of this tag can be set using a scripting plugin. Ephesoft Transact does not set a value for this field. Review/ Validation Review/ Validation

Document
LevelFields

This field contains the parent node for all index fields inside the documents. Extraction Module All plugins inside extraction module can modify this field.

Pages

This field contains the parent node for all pages inside the document. Document Assembly DOCUMENT_ASSEMBLER

DataTables

This is the root node for all the extracted tables information in the document. Extraction Module TABLE_EXTRACTION

AutoSuggestedDataTables

All the auto extraction from documents are present under this node. Extraction Module AUTP_TABLE_EXTRACTION_PLUGIN

MultiPage
TiffFile

This value contains the name of the multipage TIFF file created. The COPY_BATCH_XML plugin changes this name to the exact location while exporting the file to a user-defined location. Please note the batch.xml in the ephesoft-system-folder in the shared folders that still contain the file name. Export Module CREATEMULTIPAGE_FILES

COPY_BATCH_XML

MultiPage
PdfFile

This value contains the name of the multipage PDF file created. The COPY_BATCH_XML plugin changes this name to the exact location while exporting the file to a user-defined location. Please note the batch.xml in the ephesoft-system-folder in the shared folders that still contain the file name. Export Module CREATEMULTIPAGE_FILES

COPY_BATCH_XML

FinalMultiPage
PdfFilePath

This element is present in the batch.xml in the ehesoft-system-folder and contains the absolute path of the exported multipage PDF document (by the COPY_BATCH_XML plugin). This value is different from MultiPagePdfFile which contains the file name only. Export Module COPY_BATCH_XML

FinalMultiPage
TiffFilePath

This element is present in the batch.xml in the ehesoft-system-folder and contains the absolute path of the exported multipage TIFF document (by the COPY_BATCH_XML plugin). It is different from MultiPagePdfFile which contains the file name only. Export Module COPY_BATCH_XML

Page Fields

Page fields in the batch.xml schema and XSD apply to all pages in the document. Refer to the following table for the page fields in the Ephesoft batch.xml schema and XSD.

Note: For page fields in the batch.xml schema, refer to Email Metadata in the Batch.xml Schema.

Page Field Name Description Module Plugin

Identifier

This is the document identifier for a document. The sequence for document numbering is PG0, PG1 … PGn.

Note: The Folder Import module breaks up each page in the source PDF into individual TIFF files. Each TIFF file is a page in the XML file. The pages can be grouped as documents.

Folder Import IMPORT_BATCH_FOLDER
_PLUGIN

OldFileName

This tag contains the name of the mapped individual TIFF file within the input folder for the batch. The input folder path is available in the tag UNCFolderPath under batch-level fields. Folder Import IMPORT_BATCH_FOLDER
_PLUGIN

NewFileName

This tag contains the name of the mapped individual TIFF file within the Ephesoft system folder. The Ephesoft system folder path is available in the tag BatchLocalPath. The path to the batch instance folder is <BatchLocalPath>\<BatchInstanceIdentifier>. The name of the associated file to this page is a combination of the batch instance identifier and the page sequence. Folder Import IMPORT_BATCH_FOLDER
_PLUGIN

SourceEmailID

This element links the page to the source email from which the page originated. It contains only the identification (ID) of the source email. This ID can be searched for in separate section of the batch.xml which contains more details. Folder Import Module IMPORT_BATCH_FOLDER

SourceFileID

This element links the page to the source file which was originally placed in the watch folder. It contains only id of the source file. This id can be looked up in separate section in the batch.xml which contains more details. Folder Import Module IMPORT_BATCH_FOLDER

PageLevelFields

This value contains the classification information from different configured plugins. The values in this section are used while assembling pages into documents. Document Assembly DOCUMENT_ASSEMBLER

HocrFileName

The Recostar_HOCR_
Generation_Plugin (for Windows) or the NUANCE_HOCR (for Linux) extract the contents of each page (individual TIFF).The contents are stored in an XML file which is located in the batch instance folder (<BatchLocalPath>\<BatchInstanceIdentifier>). This tag stores the name of the HOCR XML file for the corresponding page.
Page Process RECOSTAR_HOCR_GENERATION_PLUGIN

ThumbnailFileName

This tag stores the name of the corresponding thumbnail for the page.

The Image_Process_Create
_Thumbnails_Plugin is used to create thumbnail images of the batch images. These thumbnails are displayed in the Review and Validate screen, where pages in the documents are shown as thumbnails under the document name. The thumbnails are stored in the batch instance folder (<BatchLocalPath>\<BatchInstanceIdentifier>).

Page Process IMAGE_PROCESS_CREATE
_THUMBNAILS_PLUGIN

ComparisonThumbnailFileName

This value contains the name of the thumbnail file which can be used by the CLASSIFY_IMAGES plugin. This element will be present only when the Create Compare Thumbnail Switch is on in the CREATE_THUMBNAILS plugin. Page Process CREATE_THUMBNAILS

DisplayFileName

The Image_Process_Create
_Display_Image_Plugin performs the functionality of creating the display PNG files for the images being processed. This plugin takes all the images and creates PNG files for the corresponding pages and is displayed on the Review and Validate screens. The display images are stored in the batch instance folder (<BatchLocalPath>\<BatchInstanceIdentifier>). This tag stores the name of the corresponding display image for the page.
Page Process IMAGE_PROCESS_CREATE
_DISPLAY_IMAGE_PLUGIN

OCRInputFile
Name

This tag stores the name of the file that was used by the Recostar_HOCR_Generation_Plugin to extract the contents of the page. The image will be the corresponding individual TIFF for the page available in the batch instance folder (<BatchLocalPath>\<BatchInstanceIdentifier>). Page Process RECOSTAR_HOCR_GENERATION_PLUGIN

Direction

This field indicates the direction of a rotated document. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

IsRotated

This field indicates whether a document is rotated on the Review/Validate screen. Folder Import IMPORT_BATCH_FOLDER_PLUGIN

TreatAsEText

For a PDF, the batch.xml is updated for each page that is EText compatible or not EText compatible. The element at page level is <TreatAsEText> with values of true or false. The <TreatAsEText> element will not display for the pages having an input file other than PDF in UNC. Page Process Recostar_HOCR

Nuance_HOCR

ImprintedString

Read the imprinted string from serialized file for each image. If the imprinter was enabled during scanning, then and add this string as a page-level field in batch.xml. This will be used during rescan. Insert the functionality on Review and Validate screen. Folder Import Import_Batch_Folder

IsBlank

IsBlank under the Page tag is used if there is no HOCR content associated with the page (i.e., if there is a blank HOCR XML associated for the page or image). Page Process Recostar_HOCR

Nuance_HOCR

 

Page-Level Fields

Page-level fields apply to each document page in the batch.xml file. Refer to the following table for the page-level fields in the Ephesoft batch.xml schema and XSD.

Note: For page-level fields in the batch.xml schema, refer to Email Metadata in the Batch.xml Schema.

Page-Level Field Name Description Module Plugins

Name

This tag contains the name of the classification used to classify this page. Page Process SEARCH_
CLASSIFICATIONMULTIDIMENSIONAL_
CLASSIFICATION_PLUGIN

Value

Each document type within the batch class is sub divided into pages (first, middle, and last). This tag holds the document page for which this page was classified. Page Process SEARCH_
CLASSIFICATIONMULTIDIMENSIONAL_
CLASSIFICATION_PLUGIN

Type

This tag is used in barcode classification only where it keeps information about barcode type. Page Process SEARCH_
CLASSIFICATIONMULTIDIMENSIONAL_
CLASSIFICATION_PLUGIN

Confidence

This tag holds the confidence score with which the page was classified. This confidence is used while assembling document in the Document Assembly module. Page Process SEARCH_
CLASSIFICATIONMULTIDIMENSIONAL_
CLASSIFICATION_PLUGIN

LearnedFileName

This tag holds the name of the lucene-search-classification-sample matched against this page/image. Page Process SEARCH_
CLASSIFICATIONMULTIDIMENSIONAL_
CLASSIFICATION_PLUGIN

 

Email Metadata in the Batch.xml Schema

Ephesoft Transact 2019.1 and later releases support email specific metadata passing through to the batch.xml schema.

For additional information about accessing and ingesting email header information for batch instances and how that metadata is passed through to the batch.xml, refer to the following article:

Accessing Email Headers in the Batch.xml Schema

Case Studies for the Batch.xml File

Case Study 1 — Accessing the Batch.xml File

Perform the following steps to access the batch.xml file for your deployment of Ephesoft Transact.

  1. Open Ephesoft Transact, log in as Administrator, and open the Batch Class Management screen.
  2. Open a batch class that contains a fully configured Export module, and in which at least one batch instance has been processed.
  3. Within the Export module, select the COPY_BATCH_XML plugin. The Plugin Configuration screen for this plugin appears on the right.

Plugin Configuration for COPY_BATCH_XML Plugin

      4. This screen displays the path to the batch.xml file in the field titled Batch XML Export Folder Location field.
5. To access this batch.xml file, navigate to this path on the Ephesoft Transact server. Within the sub-folder named final-drop-folder, there will appear another sub-folder that is named for the batch instance. The                following snapshot illustrates one such example of a sub-folder for the batch instance.

Sample Batch Instance Subfolder that contains the batch.xml file for that batch instance

The batch.xml file contained in this sub-folder is also named according to the batch instance in which it was created. The following snapshot illustrates one example of a batch.xml filename:

Sample batch.xml filename

6. Open the batch.xml file with Notepad+ or a similar text editor. Right click on the file, select Open With, and choose an application:

Right click on the batch.xml file and open with a text editor

Case Study 2 — BatchReviewedBy and BatchValidatedBy Nodes

This is a batch-level case study. This example illustrates the BatchReviewedBy and BatchValidatedBy fields in which there are three active users:

  • User1 — reviews some documents in the batch
  • User2 — reviews some documents in the batch
  • Ephesoft — reviews some documents in the batch, and then validates the batch

On the batch instance level, note the following activity from each user:

  • Between User1 and User2, it was User2 who was active last in the review module, so this is the user that is cited in the BatchReviewedBy field.
  • The BatchValidatedBy field indicates the Ephesoft user validated this batch instance.

Batch.xml file example with batch instance information

Case Study 3 —ReviewedBy and ValidatedBy Nodes in Batch.xml

This is a document-level case study. Note that the users shown in this case study each performed the following tasks:

  • The ephesoft user performed incomplete and partial review of the documents. user1 and user2 continued document review for this batch instance, and user2 finished the document review.
  • The ephesoft user performed validation for the documents in this batch instance.

Batch.xml example with user information for Document 1