EText Support – Leveraging Existing Text Layer in PDF Documents

Introduction

EText Functionality

Page Process Module

Learn File(s) / Test Classification / Test Extraction

Extraction Configuration Screens

Batch.xml File

Export Module

Web Services

Use Cases

EText Mode – Automatically

EText Mode – Always

EText Mode – Never

Introduction

Ephesoft Transact fully leverages the text embedded in computer generated PDFs (also referred to by RecoStar as EText). This helps to boost the accuracy of extracted text and greatly reduce the effort required for extraction on all projects that include processing of electronically generated documents.

Rather than being OCRed as images, the documents with EText layer are processed using a special mechanism, which helps to extract data directly. As per provided configuration, the EText support feature can be used:

  • AUTOMATICALLY
  • ALWAYS
  • NEVER

This feature also allows to create the exported PDF artefact(s) from the original file rather than use an image with a text layer attached.

The following languages are supported with EText feature:

1. English
2. French
3. German
4. Italian
5. Spanish
6. Dutch

EText Functionality

The feature is implemented on the following levels:

Page Process Module

  • RECOSTAR_HOCR plugin (Windows)
  • NUANCE_HOCR plugin (Linux)

Learn File(s) / Test Classification / Test Extraction

Extraction Configuration Screens

  • KV Extraction Rule
  • Paragraph Extraction Rule
  • Cross Section Extraction
  • Table Extraction Rule
  • KV Page Process Extraction Rule

Batch.xml File

  • At root level (<ETextMode>)
  • At page level (<TreatAsEText>)

Export Module

  • CREATEMULTIPAGE_FILES

Web Services

Three new web services have been added and 11 existing web services have been updated to support the new functionality.

Feature implementation on various levels is described in detail below.

 

Page Process Module

RECOSTAR_HOCR plugin (Windows)

Two fields “Use EText Recostar Project File” and “PDF EText Recostar Project File Name” have been added to the plugin. Both new fields are mandatory.

Use EText Recostar Project File Allows the user to select one of three available options provided for EText support:

– Automatically

– Always

– Never

The default value is Never.

PDF EText Recostar Project File Name Is used to select the Recostar project file which will be used for EText OCRing if the option selected in the first field is either Always or Automatically.

If needed, this rsp file can be manipulated by users for more granular control or customization.

There are four .rsp files available in the drop-down: FPR_EText.rsp, FPR.rsp, FPR_MultiLanguage.rsp and FPR_Barcode.rsp. The FPR_EText.rsp is selected by default. In rare cases, when the user is selecting any other file, they need to make sure to change the ProcessingMode parameter in them from OcrFromImage to ExtractFromEtext for the functionality to work. This approach helps to avoid duplication of .rsp files in the fixed-form-extraction folder, yet gives the user the required flexibility.

Depending on the option selected in the first field, the “PDF EText Recostar Project File Name” and “Image OCR Recostar Project File Name” fields will be disabled or enabled:

Use EText Recostar Project File PDF EText Recostar Project File Name Image OCR Recostar Project File Name
Automatically Enabled Enabled
Always Enabled Disabled
Never Disabled Enabled

Important: The default OCR language for FRP_EText.rsp is English (USA). To update the language(s), this file will need to be manually changed. (SharedFolder<Batch Class Id>fixed-form-extraction). For that, select required language(s) in the OCR Country/Language on the RECOSTAR_HOCR plugin screen, copy selection and paste it under the Country parameter, save the file.

 

NUANCE_HOCR plugin (Linux)

One mandatory field “Process PDFs as EText Files” has been added to the plugin. Same as in the RECOSTAR_HOCR plugin, this field allows the user to select one of three available options provided for EText support:

– Automatically

– Always

– Never

The default value is Never.

EText support functionality is implemented in the similar way both in Windows and Linux environments.

 

EText Support – AUTOMATICALLY

– All input PDF files are analysed and EText compatible files are identified.

– During batch processing, the EText_Processing folder is created under the ephesoft-system-folder (SharedFoldersephesoft-system-folder<Batch Instance Id>). The folder structure along with included files for RECOSTAR_HOCR plugin is presented below.

ephesoft-system-folder
Batch Instance ID

(e.g. BI1)

– All extracted TIFF/PNG files except TIFF/PNGs which correspond to EText compatible PDF pages in the EText_Processing folder

– Rsbatch file containing TIFF/PNG files names except TIFF/PNGs which correspond to EText compatible PDF pages in the EText_Processing folder*

– HOCR files for non-EText-compatible pages

EText_Processing – All EText compatible PDF files that are resized and denormalized (later to be used for Export)

– Rsbatch file containing each single page PDF name for EText compatible files*

– HOCR files for all single page EText compatible PDFs

* For NUANCE_HOCR plugin, no rsbatch file is created in the EText_Processing folder since this file is specific to RecoStar.

All HOCR files present in EText_Processing folder are also copied to the Batch Instance folder under the ephesoft-system-folder.

 

EText Support – ALWAYS

– All input PDF files identified during execution of RECOSTAR_HOCR / NUANCE_HOCR plugin are by default considered to be EText compatible.

– During batch processing, the EText_Processing folder is created under the ephesoft-system-folder (SharedFoldersephesoft-system-folder<Batch Instance Id>). The folder structure along with included files for RECOSTAR_HOCR plugin is presented below.

ephesoft-system-folder
Batch Instance ID

(e.g. BI1)

– All extracted TIFF/PNG files except TIFF/PNGs which correspond to PDF pages in the EText_Processing folder

– Rsbatch file containing TIFF/PNG files names except TIFF/PNGs which correspond to PDF pages in the EText_Processing folder*

– Blank HOCR files in case of non-PDF input files or for PDF files without the EText layer#

EText_Processing – All extracted PDF files that are resized and denormalized (later to be used for Export)

– Rsbatch file containing each single page PDF name*

– HOCR files for all single page PDFs (blank HOCR files in case of non-EText-compatible pages)

* For NUANCE_HOCR plugin, no rsbatch file is created in the EText_Processing folder since this file is specific to RecoStar.

# NUANCE_HOCR plugin generates normal HOCRs for TIFF/PNG files, as the configuration SETTINGS.sts file used for Nuance has both PDF and non-PDF properties which are used according to the file type.

All HOCR files present in EText_Processing folder are also copied to the Batch Instance folder under the ephesoft-system-folder.

 

EText Support – NEVER

– There is no special handling of EText compatible files.

EText_Processing folder is not created for a batch instance.

– HOCR files are generated for all TIFF/PNGs under the ephesoft-system-folder (SharedFoldersephesoft-system-folder<Batch Instance Id>) as was done prior to 2019.1 release.

 

Learn File(s) / Test Classification / Test Extraction

All these three functionalities depend on the EText Support mode (ALWAYS/AUTOMATICALLY/NEVER) and will generate only 1 HOCR at a time.

EText Support Mode
Automatically Always Never
Whenever a PDF is uploaded, each page of the PDF is verified for EText availability. If EText layer is available, individual PDF pages are considered for OCRing, else TIFF files are used. Whenever a PDF is uploaded, each individual PDF page is considered for OCRing.

In case of RECOSTAR_HOCR plugin, blank HOCR files are created for non-PDF pages/files.In case of NUANCE_HOCR plugin, normal HOCR files are created for non-PDF pages/files.

Only TIFF files are considered for OCRing as in the previous versions of Transact.

Note on the Learn Files functionality: If the EText support mode is Always or Automatically and the user deletes the HOCR files manually from the corresponding folders, these EText PDF files will need to be re-uploaded and re-learned since individual PDF pages are not saved in the backend folder.

 

Extraction Configuration Screens

To implement EText support feature, a new dropdown has been added to the Extraction configuration screens like KV Extraction Rule, Paragraph Extraction Rule, Cross Section Extraction Rule, Table Extraction Rule etc. (in general wherever View OCR Data functionality is provided on the screen except for Review-Validate screens). The new dropdown contains two options: EText and OCR, with OCR as default.

Whenever a PDF is uploaded on any Extraction Configuration screen, it is analysed to identify whether it is EText compatible, irrespective of EText Support mode in Recostar_HOCR / NUANCE_HOCR plugins (ALWAYS/AUTOMATICALLY/NEVER).

If the PDF page is EText compatible, both TIFF as well as PDF page are considered for OCRing, thus the system creates 2 HOCR files. The new dropdown will become enabled, allowing the user to select which HOCR file (EText or OCR) to use to view OCR data. The default value of the dropdown is OCR.

Note: The EText option in the dropdown is only used to view OCR data and test extraction. The overlays to create extraction rules are still going to be generated on the image as it was done previously.

If the PDF page is not EText compatible, only TIFF file is used for OCRing and the new dropdown will be disabled, with OCR option selected by default.

As for the View OCR functionality on the Review-Validate screens, the system will use only the generated HOCR contained inside the ephesoft-system-folder<Batch Instance Id>.

Batch.xml File

Two new parameters have been added to the batch.xml file:

– <ETextMode> at the root level with three possible values (Always, Automatically, Never).

– <TreatAsEText> at the page level with two possible values (true, false).

EText Support mode <ETextMode> element

at the root level

<TreatAsEText> element

at the page level

AUTOMATICALLY Has a value of Automatically All PDF pages in the batch.xml have the attribute <TreatAsEText> with a value of true for EText compatible page and false for non-compatible pages.

Note: This attribute does not appear for input pages that are not PDF.

ALWAYS Has a value of Always All PDF pages (both EText compatible and non-compatible) in the batch.xml have the attribute <TreatAsEText> with a value of true.

Note: This attribute does not appear for input pages that are not PDF.

NEVER Has a value of Never The <TreatAsEText> element is not included in the batch.xml

 

Export Module

To implement the EText Support feature, a new switch “Use EText Original Pages When Available” with values of TRUE and FALSE has been added to the CREATEMULTIPAGE_FILES plugin of the Export module both in Windows and Linux environments. The default value is TRUE.

TRUE Export PDFs are generated using the input PDFs, provided ALL pages of the document in batch.xml are EText compatible (<TreatAsEText> element for all pages is TRUE).

If some pages of the document are EText compatible and some are not (<TreatAsEText> element is FALSE or absent), export files are created after the TIFF/PNG to PDF conversion as it was done in previous versions of Transact.

FALSE Export files are created after the TIFF/PNG to PDF conversion as it was done in previous versions of Transact.

Below is a short overview of functionality in various scenarios:

Switch Value EText Support mode
Automatically Always Never
TRUE All pages of the document in the batch.xml are scanned for the <TreatAsEText> element value.

– If all the values are true, then these denormalized PDF pages are fetched from ephesoft-system-folder<Batch Instance Id>EText_Processing folder to generate the export PDF for that document. No TIFF to PDF conversion occurs.

– If a <TreatAsEText> element is false or absent for any page of the document, then the previous functionality of the plugin will take place to generate the Export PDF for that document. TIFF to PDF conversion occurs.

<TreatAsEText> element in the batch.xml is not considered, the plugin generates export PDFs as was done previously, i.e. by converting TIFF to PDF.
FALSE The plugin generates export PDFs as was done previously, i.e. by converting TIFF to PDF.

 

Web Services

The following new webservices are created to support EText:

  • /splitMultiPagePDFtoPDFs
  • /createMultiPagePDFFromSinglePDFs
  • /batchClass/uploadLearnFiles

The following existing webservices are updated to support EText:

  • /createOCR
  • /batchClass/learnFile/{batchClassIdentifier}
  • /ocrClassifyExtract
  • /ocrClassifyExtractSearchablePdf
  • /initiateOcrClassifyExtract
  • /ocrClassify
  • /v2/ocrClassifyExtract
  • /v2/ocrClassifyExtractBase64
  • /executeMobileUpload
  • /keywordClassification
  • /createHOCRForBatchClass

Detailed documentation for all available web services is provided on Swagger:

http://<IP address>:<port>/dcma/rest/swagger-ui.html

 

Use Cases

EText Mode – Automatically

Pre-requisites:

For Windows environment, in the RECOSTAR_HOCR plugin under the Page Process Module:

Automatically option should be selected from the “Use EText Recostar Project File” dropdown.

– Both dropdowns “PDF EText Recostar Project File Name” and “Image OCR Recostar Project File Name” are enabled and FPR_EText.rsp and Fpr.srp files should be selected by default.

OR

For Linux environment, in the NUANCE_HOCR plugin under the Page Process Module:

Automatically option should be selected from the “Process PDFs as EText Files”.

For both Windows and Linux OS, in the CREATEMULTIPAGE_FILES plugin under the Export Module, the value of the “Use EText Original Pages When Available” switch should be TRUE, if you want to export original PDF files used at input.

Now, let us process three-page PDF document which contains both EText compatible as well as non-compatible pages (the first page is non-compatible, the second and third pages contain EText layer). For that:

1. On the Batch Class Management screen, create a batch class.

2. Open your batch class and create a document type.

3. On the Document Types screen, upload a Learn File.

4. On the same screen, upload Test Classification File to check classification results, and click Test Classification.

5. On the Test Classification screen, click Classify to see classification results. HOCR files are created both for non-EText-compatible page as well as EText compatible pages, so all three pages are classified successfully.

Note: HOCR files will be created for all types of input files, with a separate mechanism being used for all EText compatible PDF pages.

6. Navigate to the Index Fields screen and create index fields. In this case, two index fields are created:

 – TransID index field on the first non-EText-compatible page, and

 – Product No index field on the second Etext compatible page.

7. Create extraction rules for your index fields by expanding Index Fields menu, selecting the index field and required extraction rule. In this case, KV extraction rules will be created for pages 1 and 2.

8. On the Extraction Configuration screen, upload the file and create extraction rules.

Since the first page of our file is not EText compatible, the drop-down menu allowing to view OCR from EText PDF is disabled.

The second and third pages of the document do have EText layer, so the dropdown becomes enabled on all Extraction Configuration screens, allowing you to select the source to view OCR data. EText extracted data is often more accurate. You can check it by selecting EText in the dropdown menu and clicking View OCR Data.

9. Once extraction rules are created, navigate back to the Document Types screen to test extraction results. For that, upload Test Extraction file and click Test Extraction.

10. On the Test Extraction screen, click Extract to see extraction results. As you can see below, the values are extracted from all pages, both non-EText-compatible (TransID) and EText compatible page (Product No).

11. Now, let us process a batch. Navigate to the Upload Batch screen, select a batch class and priority, upload the file and click Start Batch.

As the batch is being processing, a new folder is created under the ephesoft-system-folder<Batch Instance Id>, called “EText_Processing”.

This folder contains all original and denormalized EText compatible PDF pages present in the batch along with HOCR files. Denormalized files are going to be used later, at the Export stage. There is also a FileList.rsbatch file, which includes the names of all single page EText compatible PDFs.

In this case, the batch contains two EText compatible page, all details of which are saved in the EText_Processing folder.

If the PDF page is not EText compatible or there are input files of a different format, their HOCR files will be saved in the ephesoft-system-folder<Batch Instance Id> folder. The same folder will also contain a separate rsbatch file with names of these files.

In our case, there is one non-EText-compatible PDF page. It is saved in the BIA folder under the ephesoft-system-folder and its name is included in the rsbatch file.

12. Since all the pages are classified successfully, the batch doesn’t stop at the Review stage.

If the batch stops at the Validation stage, both values will be extracted successfully – both from EText compatible as well as non-compatible pages. In this case, one of the fields was set to Force Review. Click Validate to proceed.

13. The batch processing is complete and the batch.xml along with PDF file is exported to the final-drop-folder, as configured (SharedFoldersfinal-drop-folder<Batch Instance Id>).

14. Navigate to the folder and open batch.xml.

The batch.xml file includes two new attributes:

– At the root level, the value of the new “ETextMode” element is AUTOMATICALLY as configured.

– At the page level, the new “TreatAsEText” element is FALSE for all non-EText-compatible PDF pages and TRUE for all EText compatible PDF pages. In this case, this element value is FALSE for the first page and TRUE for two remaining pages.

Note: The <TreatAsEText> element (TRUE/FALSE) in the batch.xml applies only to pages in PDF format. This element will be absent for all pages in other formats (TIFF, PNG, Word etc.).

15. Check exported PDF.

Since we configured “Use EText Original Pages When Available” switch as TRUE in the CREATEMULTIPAGE_FILES plugin, all the pages of the document in the batch.xml file are scanned for the <TreatAsEText> element value.

Since the EText Support mode is AUTOMATICALLY, the <TreatAsEText> element is FALSE for the first PDF page of our batch and TRUE for the second and third pages. As per the feature implementation, TIFF to PDF conversion is used for all three pages (since the value of the <TreatAsEText> element in the batch.xml file is not TRUE for all pages). This means that export PDF does not contain EText layer on any page, all exported pages are not EText compatible.

Note: If export PDFs are generated as searchable using ITEXT (see configuration below), then such output PDF files also become EText compatible. In this case, however, they are not vector based PDFs as original input files, but rather have OCRed layer attached on top of the image.

 

EText Mode – Always

Pre-requisites:

For Windows environment, in the RECOSTAR_HOCR plugin under the Page Process Module:

Always option should be selected from the “Use EText Recostar Project File” dropdown.

FPR_EText.rsp file should be selected from the “PDF EText Recostar Project File Name” dropdown. “Image OCR Recostar Project File Name” dropdown is automatically disabled.

OR

For Linux environment, in the NUANCE_HOCR plugin under the Page Process Module:

Always option should be selected from the “Process PDFs as EText Files”.

For both Windows and Linux OS, in the CREATEMULTIPAGE_FILES plugin under the Export Module, the value of the “Use EText Original Pages When Available” switch should be TRUE, if you want to export original PDF files used at input.

Now, let us process three-page PDF document which contains both EText compatible as well as non-compatible pages (the first page is non-compatible, the second and third pages are EText compatible). For that:

1. On the Batch Class Management screen, create a batch class.

2. Open your batch class and create a document type.

3. On the Document Types screen, upload a Learn File.

4. On the same screen, upload Test Classification File to check classification results, and click Test Classification.

5. On the Test Classification screen, click Classify to see classification results. Since no HOCR file is created for non-EText-compatible page, the first page is not classified. In this scenario, HOCR files are created only for EText compatible pages, so the remaining two pages are classified successfully.

Note: When the EText Support mode is ALWAYS, HOCR files are created ONLY for EText compatible PDF pages. HOCR files are not created for any other type of input files.

6. Navigate to the Index Fields screen and create index fields. In this case, two index fields are created:

 – TransID index field on the first non-EText -compatible page, and

 – Product No index field on the second Etext compatible page.

7. Create extraction rules for your index fields by expanding Index Fields menu, selecting the index field and required extraction rule. In this case, KV extraction rules will be created for pages 1 and 2.

8. On the Extraction Configuration screen, upload the file and create extraction rules.

Since the first page of our file is not EText compatible, the dropdown menu allowing to view OCR from EText PDF is disabled.

The second and third pages of the document do have EText layer, so the dropdown becomes enabled on all Extraction Configuration screens, allowing you to select the source to view OCR data. EText extracted data is often more accurate. You can check it by selecting EText in the dropdown menu and clicking View OCR Data.

9. Once extraction rules are created, navigate back to the Document Types screen to test extraction results. For that, upload Test Extraction file and click Test Extraction.

10. On the Test Extraction screen, click Extract to see extraction results. As you can see below, the value is only extracted from the EText compatible page (Product No). Since the selected EText mode is ALWAYS, only EText compatible pages are processed in this case.

11. Now, let us process a batch. Navigate to the Upload Batch screen, select a batch class and priority, upload the file and click Start Batch.

As the batch is being processing, a new folder is created under the ephesoft-system-folder<Batch Instance Id>, called “EText_Processing”.

This folder contains all original and denormalized PDF pages present in the batch along with HOCR files. Denormalized files are going to be used later, at the Export stage. There is also a FileList.rsbatch file, which includes the names of all single page PDFs.

If the PDF page is not EText compatible, its HOCR file will be blank. In our case, the first page does not contain EText layer, so its HOCR file looks as follows:

Note: If a batch contains any non-PDF pages, the HOCR files created for them will also be blank and will be located in the batch instance folder under the ephesoft-system-folder (SharedFoldersephesoft-system-folder<Batch Instance Id>).

In case of NUANCE_HOCR plugin, HOCR files will be created as usual both for PDF and non-PDF pages.

12. Our batch contains one page that is not EText compatible and HOCR file for this page is blank. Hence it cannot be classified, and the batch stops at the Review stage. All EText compatible pages are classified successfully.

Select the document type from the dropdown list, if needed merge selected pages and click Review to proceed.

Note: View OCR functionality on the Review-Validate screens for non-EText-compatible pages will not work in this case. When the View OCR button is clicked, the system will return blank page. When moving to the non-EText compatible pages, the following message is displayed:

13. On the Validation screen, the values are only extracted for the EText compatible pages. In this case, only the Product No value is extracted, while TransID index field value is blank as its KV extraction rule is configured for a non-EText-compatible page.

Since there is no HOCR file created for non-compatible page, the value cannot be specified by creating overlay on the image on the RV screen and needs to be entered manually, if required. Once done, click Validate to proceed.

14. The batch processing is complete and the batch.xml along with PDF file is exported to the final-drop-folder, as configured (SharedFoldersfinal-drop-folder<Batch Instance Id>).

Navigate to the folder and open batch.xml.

Batch.xml contains two new attributes:

– At the root level, the value of the new “ETextMode” element is ALWAYS as configured.

– At the page level, the new “TreatAsEText” element is TRUE for all PDF pages, both EText compatible and not (as per implementation of the ALWAYS EText Support mode).

In this case though, the first page is not EText compatible, so no HOCR has been created for it. Hence, this page has one more attribute “IsBlank”, the value of which is TRUE. The second and third pages contain EText layer, HOCR files are created for them, so “IsBlank” element is not present for these pages.

Note: The <TreatAsEText> element (TRUE/FALSE) in the batch.xml applies only to pages in PDF format. This element will be absent for all pages in other formats (TIFF, PNG, Word etc.).

15. Check exported PDF.

Since we configured “Use EText Original Pages When Available” switch as TRUE in the CREATEMULTIPAGE_FILES plugin, all the pages of the document in the batch.xml file are scanned for the <TreatAsEText> element value.

Since the EText Support mode is ALWAYS, the <TreatAsEText> element is TRUE for all PDF pages, that means that PDF pages are fetched from the ephesoft-system-folder<Batch Instance Id>EText_Processing folder to generate the export PDF for that document. No TIFF to PDF conversion occurs in this case. Thus, our document will contain the first non-EText-compatible page and two remaining pages with EText layer, same as during input.

 

EText Mode – Never

In this case, the functionality works same way as it did prior to the 2019.1 release and EText Support feature implementation.

Pre-requisites:

For Windows environment, in the RECOSTAR_HOCR plugin under the Page Process Module:

Never option should be selected from the “Use EText Recostar Project File” dropdown.

Fpr.rsp file should be selected from the “Image OCR Recostar Project File Name” dropdown. “PDF EText Recostar Project File Name” dropdown is automatically disabled.

OR

For Linux environment, in the NUANCE_HOCR plugin under the Page Process Module:

Never option should be selected from the “Process PDFs as EText Files”.

Since no EText processing is done in this case, the value of the “Use EText Original Pages When Available” switch in the CREATEMULTIPAGE_FILES plugin under the Export Module does not matter. It can be either TRUE by default or FALSE.

Now, let us process three-page PDF document which contains both EText compatible as well as non-compatible pages (the first page is non-compatible, the second and third pages are EText compatible). For that:

1. On the Batch Class Management screen, create a batch class.

2. Open your batch class and create a document type.

3. On the Document Types screen, upload a Learn File.

4. On the same screen, upload Test Classification File to check classification results, and click Test Classification.

5. On the Test Classification screen, click Classify to see classification results. HOCR files are created for all pages as usual using PDF to TIFF conversion, so all three pages are classified successfully.

6. Navigate to the Index Fields screen and create index fields. In this case, two index fields are created:

 – TransID index field on the first non-EText-compatible page, and

 – Product No index field on the second Etext compatible page.

7. Create extraction rules for your index fields by expanding Index Fields menu, selecting the index field and required extraction rule. In this case, KV extraction rules will be created for pages 1 and 2.

8. On the Extraction Configuration screen, upload the file and create extraction rules.

Since the first page of our file is not EText compatible, the dropdown menu allowing to view OCR from EText PDF is disabled.

The second and third pages of the document do have EText layer, so the dropdown becomes enabled on all Extraction Configuration screens, allowing you to select the source to view OCR data. EText extracted data is often more accurate. You can check it by selecting EText in the dropdown menu and clicking View OCR Data.

9. Once extraction rules are created, navigate back to the Document Types screen to test extraction results. For that, upload Test Extraction file and click Test Extraction.

10. On the Test Extraction screen, click Extract to see extraction results. As you can see below, the values are extracted from all pages (TransID and Product No).

11. Now, let us process a batch. Navigate to the Upload Batch screen, select a batch class and priority, upload the file and click Start Batch.

As the batch is being processing, all converted TIFF/PNG images and HOCR files are saved in the ephesoft-system-folder<Batch Instance Id> folder as in previous versions of Transact (no “EText_Processing” folder is created in this case).

The same batch instance folder also contains a FileList.rsbatch file, which includes the names of all single pages of the input file.

12. Since all the pages are classified successfully, the batch doesn’t stop at the Review stage.

If the batch stops at the Validation stage, both values will be extracted successfully since HOCR files are created for all input pages. In this case, one of the fields was set to Force Review. Click Validate to proceed.

13. The batch processing is complete and the batch.xml along with PDF file is exported to the final-drop-folder, as configured (SharedFoldersfinal-drop-folder<Batch Instance Id>).

Navigate to the folder and open batch.xml.

Batch.xml contains only one new attribute:

– At the root level, the value of the new <ETextMode> element is NEVER as configured.

Note: At the page level, the new <TreatAsEText> element will not be used in this case.

14. Check exported PDF.

Since usual PDF to TIFF / TIFF to PDF conversion is used in this case and no EText processing is involved, the exported PDF will not contain pages with EText layer, even though some input pages were EText compatible.

Note: If export PDFs are generated as searchable using ITEXT (see configuration below), then such output PDF files also become EText compatible. In this case, however, they are not vector based PDFs as original input files, but rather have OCRed layer attached on top of the image.